1
The Creation and Detection of Deepfakes: A Survey
YISROEL MIRSKY
, Georgia Institute of Technology and Ben-Gurion University
WENKE LEE, Georgia Institute of Technology
Generative deep learning algorithms have progressed to a point where it is dicult to tell the dierence between
what is real and what is fake. In 2018, it was discovered how easy it is to use this technology for unethical
and malicious applications, such as the spread of misinformation, impersonation of political leaders, and the
defamation of innocent individuals. Since then, these ‘deepfakes’ have advanced signicantly.
In this paper, we explore the creation and detection of deepfakes an provide an in-depth view how these
architectures work. e purpose of this survey is to provide the reader with a deeper understanding of (1) how
deepfakes are created and detected, (2) the current trends and advancements in this domain, (3) the shortcomings
of the current defense solutions, and (4) the areas which require further research and aention.
CCS Concepts:
Security and privacy Social engineering attacks;
Human and societal aspects of security
and privacy; Computing methodologies Machine learning;
Additional Key Words and Phrases: Deepfake, Deep fake, reenactment, replacement, face swap, generative AI,
social engineering, impersonation
ACM Reference format:
Yisroel Mirsky and Wenke Lee. 2020. e Creation and Detection of Deepfakes: A Survey. ACM Comput. Surv.
1, 1, Article 1 (January 2020), 38 pages.
DOI: XX.XXXX/XXXXXXX.XXXXXXX
1 INTRODUCTION
A deepfake is content, generated by an articial intelligence, that is authentic in the eyes of a human
being. e word deepfake is a combination of the words ‘deep learning’ and ‘fake’ and primarily
relates to content generated by an articial neural network, a branch of machine learning.
e most common form of deepfakes involve the generation and manipulation of human imagery.
is technology has creative and productive applications. For example, realistic video dubbing of
foreign lms,
1
education though the reanimation of historical gures [
90
], and virtually trying on
clothes while shopping.
2
ere are also numerous online communities devoted to creating deepfake
memes for entertainment,
3
such as music videos portraying the face of actor Nicolas Cage.
However, despite the positive applications of deepfakes, the technology is infamous for its un-
ethical and malicious aspects. At the end of 2017, a Reddit user by the name of ‘deepfakes’ was using
Corresponding Author
1
hps://variety.com/2019/biz/news/
ai-dubbing-david-beckham-multilingual-1203309213/
2
hps://www.forbes.com/sites/forbestechcouncil/2019/05/21/gans-and-deepfakes-could-revolutionize-the-fashion-industry/
3
hps://www.reddit.com/r/SF Wdeepfakes/
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permied. To copy other wise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2020 ACM. 0360-0300/2020/1-ART1 $15.00
DOI: XX.XXXX/XXXXXXX.XXXXXXX
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
arXiv:2004.11138v3 [cs.CV] 13 Sep 2020
1:2 Mirsky, et al.
III. Entertainment
I. Hoax
IV. Trusted
II. Propaganda
Intention to mislead
Truth
Scams & Fraud:
Trickery via spoofing, falsifying audit
records, generating artwork,
Tampering of Evidence:
Medical, forensic, court,
Harming Credibility:
Revenge porn, political sabotage via
generated videos or articles,
Altering Published Movies:
Comedy, satire,
Editing & Special Effects:
Generating actors in movies,
Art & Demonstration:
Animating dead characters, generated
portraits, technology demos,
Authentic Content:
Credible Multimedia / Data
Misdirection
Generated discourse to amplify
events / facts,
Political Warfare:
Tone change of articles, content
loosely based on facts, conspiracy
Corruption:
Increased xenophobia,
Fig. 1. A deepfake information trust chart.
Samples created by machines
to fool…
humans
machines
Deepfakes
Adversarial
Samples
both
Examples:
humans: entertainment, impersonation, art fraud.
machines: hiding a stop sign, evading face recog.
both: tampering medical scans, malware evasion.
Fig. 2. The dierence between adversarial
machine learning and deepfakes.
deep learning to swap faces of celebrities into pornographic videos, and was posting them online
4
.
e discovery caused a media frenzy and a large number of new deepfake videos began to emerge
thereaer. In 2018, BuzzFeed released a deepfake video of former president Barak Obama giving
a talk on the subject. e video was made using the Re ddit user’s soware (FakeApp), and raised
concerns over identity the, impersonation, and the spread of misinformation on social media. Fig.
presents an information trust chart for deepfakes, inspired by [49].
Following these events, the subject of deepfakes gained traction in the academic community, and
the te chnology has been rapidly advancing over the last few years. Since 2017, the number of papers
published on the subject rose from 3 to over 250 (2018-20).
To understand where the threats are moving and how to mitigate them, we need a clear view of the
technology’s, challenges, limitations, capabilities, and trajector y. Unfortunately, to the best of our
knowledge, there are no other works which present the te chniques, advancements, and challenges,
in a technical and encompassing way. erefore, the goals of this paper are (1) to provide the reader
with an understanding of how modern deepfakes are created and detecte d, (2) to inform the reader of
the recent advances, trends, and challenges in deepfake research, (3) to serve as a guide to the design
of deepfake architectures, and (4) to identify the current status of the aacker-defender game, the
aacker’s next move, and future work that may help give the defender a leading edge.
We achieve these goals through an overview of human visual deepfakes (Section 2), followed by a
technical background which identies technology’s basic building blocks and challenges (Section 3).
We then provide a chronological and systematic review for each category of deepfake, and provide the
networks’ schematics to give the reader a deeper understanding of the various approaches (Sections
4 and 5). Finally, aer reviewing the countermeasures (Section 6), we discuss their weaknesses, note
the current limitations of deepfakes, suggest alternative research, consider the adversary’s next steps,
and raise awareness to the spread of deepfakes to other domains (Section 7).
Scope. In this survey we will focus on deepfakes pertaining to the human face and body. We will
not be discussing the synthesis of new faces or the editing of facial features because they do not
have a clear aack goal associated with them. In Section 7.3 we will discuss deepfakes with a much
4
hps://www.vice.com/en
us/article/gydydm/gal-gadot-fake-ai-porn
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:3
broader scope, note the future trends, and exemplify how deepfakes have spread to other domains
and media such as forensics, nance, and healthcare.
We note to the reader that deepfakes should not be confused with adversarial machine learning,
which is the subject of fooling machine learning algorithms with maliciously craed inputs (Fig. 2).
e dierence being that for deepfakes, the objective of the generated content is to fool a human
and not a machine.
2 OVERVIEW & ATTACK MODELS
We dene a deepfake as
“Believable media generated by a deep neural network”
In the context of human visuals, we identify four categories: reenactment, replacement, editing, and
synthesis. Fig. 3 illustrates some examples facial deepfakes in each of these categories and their
sub-types. roughout this paper we denote
s
and
t
as the source and the target identities. We also
denote x
s
and x
t
as images of these identities and x
д
as the deepfake generated from s and t.
2.1 Reenactment
A reenactment deepfake is where
x
s
is used to drive the expression, mouth, gaze, pose, or body of
x
t
:
Expression
reenactment is where
x
s
drives the expression of
x
t
. It is the most common form of
reenactment since these technologies oen drive target’s mouth and pose as well, providing a
wide range of exibility. Benign uses are found in the movie and video game industry where
the performances of actors are tweaked in post, and in educational media where historical
gures are reenacted.
Mouth
reenactment, also known as ‘dubbing’, is where the mouth of
x
t
is driven by that of
x
s
, or
an audio input
a
s
containing speech. Benign uses of the technology includes realistic voice
dubbing into another language and editing.
Gaze
reenactment is where direction of
x
t
’s eyes, and the position of the eyelids, are driven by those
of
x
s
. is is used to improve photographs or to automatically maintain eye contact during
video interviews [45].
Pose
reenactment is where the head position of
x
t
is driven by
x
s
. is technology has primar-
ily been used for face frontalization of individuals in security footage, and as a means for
improving facial recognition soware [159].
Body
reenactment, a.k.a. pose transfer and human pose synthesis, is similar to the facial reenact-
ments listed above except that’s its the pose of x
t
’s body being driven.
Gaze Mouth Expression Pose Complete Transfer Swap
Source Target
Face ReplacementFacial Reenactment
: Always
: Sometimes
𝑥
𝑥
Face Editing
Hair Article Age
Beauty
Ethnicity
Gaze
Mouth
Expression
Pose
Identity
Face Synthesis
Transfers:
Fig. 3. Examples of reenactment, replacement, editing, and synthesis deepfakes of the human face.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:4 Mirsky, et al.
e Attack Model.
Reenactment deep fakes give aackers the ability to impersonate an identity,
controlling what he or she says or does. is enables an aacker to perform acts of defamation, cause
discredability, spread misinformation, and tamper with evidence. For example, an aacker can imper-
sonate
t
to gain trust the of a colleague, friend, or family member as a means to gain access to money,
network infrastructure, or some other asset. An aacker can also generate embarrassing content of
t
for blackmailing purposes or generate content to aect the public’s opinion of an individual or political
leader. e technology can also be used to tamper sur veillance footage or some other archival imagery
in an aempt to plant false evidence in a trial. Finally, the aack can either take place online (e.g.,
impersonating someone in a real-time conversation) or oine (e.g., fake media spread on the Internet).
2.2 Replacement
A replacement deepfake is where the content of
x
t
is replaced with that of
x
s
, preserving the identity
of s.
Transfer
is where the content of
x
t
is replaced with that of
x
s
. A common type of transfer is facial
transfer, used in the fashion industry to visualize an individual in dierent outts.
Swap
is where the content transferred to
x
t
from
x
s
is driven by
x
t
. e most popular type of swap
replacement is ‘face swap, oen used to generate memes or satirical content by swapping the
identity of an actor with that of a famous individual. Another benign use for face swapping in-
cludes the anonymization of one’s identity in public content in-place of blurring or pixelation.
e Attack Model.
Replacement deepfakes are well-known for their harmful applications. For
example, revenge porn is where an aacker swaps a victim’s face onto the body of a porn actress
to humiliate, defame, and blackmail the victim. Face replacement can also be used as a short-cut to
fully reenacting
t
by transferring
t
’s face onto the body of a look-alike. is approach has been used
as a tool for disseminating political opinions in the past [137].
2.3 Editing & Synthesis
An enchantment deepfake is where the aributes of
x
t
are added, altered, or removed. Some examples
include the changing a target’s clothes, facial hair, age, weight, beauty, and ethnicity. Apps such as
FaceApp enable users to alter their appearance for entertainment and easy editing of multimedia.
e same process can be used by and aacker to build a false persona for misleading others. For
example, a sick leader can be made to look healthy [
67
], and child or sex predators can change their
age and gender to build dynamic proles online. A known unethical use of editing deepfakes is the
removal of a victim’s clothes for humiliation or entertainment [134].
Synthesis is where the deepfake
x
д
is created with no target as a basis. Human face and body
synthesis techniques such as [
78
] (used in Fig. 3) can create royalty free stock footage or generate
characters for movies and games. However, similar to editing deepfakes, it can also be used to create
fake personas online.
Although human image editing and synthesis are active research topics, reenactment and replace-
ment deepfakes are the greatest concern because they give an aacker control over one’s identity[
12
,
28, 66]. erefore, in this survey we will be focusing on reenactment and replacement de epfakes.
3 TECHNICAL BACKGROUND
Although there are a wide variety of neural networks, most deepfakes are created using variations
or combinations of generative networks and encoder de coder networks. In this se ction we provide a
brief introduction to these networks, how they are trained, and the notations which we will be using
throughout the paper.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:5
3.1 Neural Networks
Neural networks are non-linear models for predicting or generating content based on an input. ey
are made up of layers of neurons, where each layer is connected sequentially via synapses. e
synapses have asso ciated weights which collectively dene the concepts learned by the model. To
execute a network on an
n
-dimensional input
x
, a process known as forward-propagation is performed
where
x
propagated through each layer and an activation function is used to summarize a neuron’s
output (e.g., the Sigmoid or ReLU function).
Concretely, let
l
(i)
denotethe
i
-thlayerinthenetwork
M
, and let
kl
(i)
k
denotethe number of neurons
in
l
(i)
. Finally, let the total number of layers in
M
be denoted as
L
. e weights which connect
l
(i)
to
l
(i+1)
are denoted as the
kl
(i)
k
-by-
kl
(i+1)
k
matrix
W
(i)
and
kl
(i+1)
k
dimensional bias vector
®
b
(i)
. Finally,
we denote the collection of all parameters
θ
as the tuple
θ (W ,b)
, where
W
and
b
are the weights of
each layer respectively. Let a
(i+1)
denote the output (activation) of layer l
(i)
obtained by computing
f
W
(i)
· ®a
(i)
+
®
b
(i)
where
f
is oen the Sigmoid or ReLU function. To execute a network on an
n
-
dimensional input
x
, a process known as forward-propagation is performed where
x
is used to activate
l
(1)
which activates l
(2)
and so on until the activation of l
(L)
produces the m-dimensional output y.
To summarize this process, we consider
M
a black box and denote its execution as
M(x)= y
. To
train
M
in a supervised seing, a dataset of paired samples with the form
(x
i
,y
i
)
is obtained and an
objective loss function
L
is dened. e loss function is used to generate a signal at the output of
M
which is back-propagated through
M
to nd the errors of each weight. An optimization algorithm,
such as gradient descent (GD), is then used to update the weights for a number of epochs. e
function
L
is oen a measure of error between the input
x
and predicted output
y
0
. As a result the
network learns the function M(x
i
)y
i
and can be use d to make predictions on unseen data.
Some deepfake networks use a technique called one-shot or few-shot learning which enables
a pre-trained network to adapt to a new dataset
X
0
similar to
X
on which it was trained. Two
common approaches for this are to (1) pass information on
x
0
X
0
to the inner layers of
M
during the
feed-forward process, and (2) perform a few additional training iterations on a few samples from
X
0
.
3.2 Loss Functions
In order to update the weights with an optimization algorithm, such as GD, the loss function must
be dierentiable. ere are various types of loss functions which can be applied in dierent ways
depending on the learning objective. For example, when training a
M
as an
n
-class classier, the
output of
M
would be the probability vector
y R
n
. To train
M
, we perform forward-propagation
to obtain
y
0
= M(x)
, compute the cross-entropy loss (
L
CE
) by comparing
y
0
to the ground truth label
y
, and then perform back-propagation and to update the weights with the training signal. e loss
L
CE
over the entire training set X is calculated as
L
CE
=
| X |
Õ
i=1
n
Õ
c=1
y
i
[c]log(y
0
i
[c]) (1)
where y
0
[c] is the predicted probability of x
i
belonging to the c-th class.
Other popular loss functions used in deepfake networks include the L1 and L2 norms
L
1
= |x x
д
|
1
and
L
2
= |x x
д
|
2
. However, L1 and L2 require paired images (e.g., of
s
and
t
with same expression)
and perform poorly when there are large osets between the images such as dierent poses or facial
features. is oen occurs in reenactment when
x
t
has a dierent pose than
x
s
which is reected
in x
д
, and ultimately we’d like x
д
to match the appearance of x
t
.
One approach to compare two unaligned images is to pass them through another network (a
perceptual model) and measure the dierence between the layer’s activations (feature maps). is
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:6 Mirsky, et al.
Encoder Decoder
𝑥

𝑥

Vanilla GAN
𝑥
𝑥
pix2pix

𝑥
𝑥

1
𝑥
(1)
2
𝑥
(2)
𝑥
(1)
𝑥
(2)
𝑥
()
𝑥
()
𝑥
()
𝑥
()
=
RNN
𝑥
𝑥
𝑥
𝑥
𝑥
′
𝑥
𝑥
CycleGAN
Generative
Discriminator
Networks
𝑥


Fig. 4. Five basic neural network architectures used to create deepfakes. The lines indicate dataflows used
during deployment (black) and training (grey).
loss is called the perceptual loss (
L
per c
) and is described in [
76
] for image generation tasks. In the
creation of deepfakes,
L
per c
is oen computed using a face recognition network such as VGGFace.
e intuition behind
L
per c
is that the feature maps (inner layer activations) of the p erceptual model
act as a normalized representation of
x
in the context of how the model was trained. erefore,
by measuring the distance between the feature maps of two dierent images, we are essentially
measuring their semantic dierence (e.g., how similar the noses are to each other and other ner
details.) Similar to
L
per c
, there is a feature matching loss (
L
F M
) [
133
] which uses the last output
of a network. e idea behind
L
F M
is to consider the high level semantics captured by the last layer
of the perceptual model (e.g., the general shape and textures of the head).
Another common loss is a type of content loss (
L
C
) [
59
] which is used to help the generator create
realistic features, based on the perspective of a perceptual model. In
L
C
, only
x
д
is passed through
the perceptual model and the dierence between the network’s feature maps are measured.
3.3 Generative Neural Networks (for deepfakes)
Deep fakes are oen created using combinations or variations of six dierent networks, ve of which
are illustrated in Fig. 4.
Encoder-Decoder Networks (ED).
An ED consists of at least two networks, an encoder
En
and
decoder
De
. e ED has narrower layers towards its center so that when it’s trained as
De(En(x)) = x
д
, the network is forced to summarize the observed concepts. e summary
of
x
, given its distribution
X
, is
En(x) = e
, oen referred to as an encoding or embedding
and
E = En(X)
is referred to as the ‘latent space. Deepfake technologies oen use multiple
encoders or decoders and manipulate the encodings to inuence the output
x
д
. If an encoder
and decoder are symmetrical, and the network is trained with the objective
De(En(x))= x
,
then the network is called an autoencoder and the output is the reconstruction of
x
denoted
ˆ
x
. Another special kind of ED is the variational autorencoder (VAE) where the encoder learns
the posterior distribution of the decoder given
X
. VAEs are b eer at generating content than
autoencoders because the concepts in the latent space are disentangled, and thus enco dings
respond beer to interpolation and mo dication.
Convolutional Neural Network (CNN).
In contrast to a fully connected (dense) network, a CNN
learns paern hierarchies in the data and is therefore much more ecient at handling imagery.
A convolutional layer in a CNN learns lters which are shied over the input forming an
abstract feature map as the output. Po oling layers are used to reduce the dimensionality as
the network gets deeper and up-sampling layers are used to increase it. With convolutional,
pooling, and upsampling layers, it is possible to build an ED CNNs for imagery.
Generative Adversarial Networks (GAN)
e GAN was rst proposed in 2014 by Goodfellow
et al. in [
61
]. A GANs consist of two neural networks which work against each other: the
generator
G
and the discriminator
D
.
G
creates fake samples
x
д
with the aim of fooling
D
,
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:7
and
D
learns to dierentiate between real samples (
x X
) and fake samples (
x
д
=G(z)
where
z N ). Concretely, there is an adversarial loss used to train D and G respectively:
L
adv
(D) = maxlogD(x)+log(1D(G(z))) (2)
L
adv
(G) = minlog(1D(G(z))) (3)
is zero-sum game leads to
G
learning how to generate samples that are indistinguishable
from the original distribution. Aer training,
D
is discarded and
G
is used to generate content.
When applied to imagery, this approach produces photo realistic images.
Numerous of variations and improvements of GANs have be en proposed over the years.
In the creation of deepfakes, there are two popular image translation frameworks which use
the fundamental principles of GANs:
Image-to-Image Translation (pix2pix).
e pix2pix framework enables paired transla-
tions from one image domain to another [
72
]. In pix2pix,
G
tries to generate the image
x
д
given a visual context
x
c
as an input, and
D
discriminates between
(x,x
c
)
and
(x
д
,x
c
)
.
Moreover,
G
is a an ED CNN with skip connections from
En
to
De
(called a U-Net) which
enables
G
to produce high delity imagery by bypassing the compression layers when
needed. Later, pix2pixHD was proposed [
170
] for generating high resolution imagery
with beer delity.
CycleGAN.
An improvement of pix2pix which enables image translation through unpaired
training [
192
]. e network forms a cycle consisting of two GANs used to convert
images from one domain to another, and then back again to ensure consistency with
a cycle consistency loss (L
cyc
).
Recurrent Neural Networks (RNN)
An RNN is type of neural network that can handle sequential
and variable length data. e network rememb ers is internal state aer processing
x
(i1)
and
can use it to process
x
(i)
and so on. In deepfake creation, RNNs are oen used to handle audio
and sometimes video. More advanced versions of RNNs include long short-term memory
(LSTM) and gate reccurent units (GRU).
3.4 Feature Representations
Most de ep fake architectures use some form of intermediate representation to capture and sometimes
manipulate
s
and
t
’s facial structure, pose, and expression. One way is to use the facial action coding
system (FACS) and measure each of the face’s taxonomized action units (AU) [
43
]. Another way is
to use monocular reconstruction to obtain a 3D morphable mo del (3DMM) of the head from a 2D
image, where the pose and expression are parameterized by a set of vectors and matrices. en use
the parameters or a 3D rendering of the head itself. Some use a UV map of the head or body to give
the network a beer understanding of the shape’s orientation.
Another approach is to use image segmentation to help the network separate the dierent concepts
(face, hair, etc). e most common representation is landmarks (a.k.a. key-points) which are a set
of dened positions on the face or body which can be eciently tracked using open source CV
libraries. e landmarks are oen presented to the networks as a 2D image with Gaussian points
at each landmark. Some works separate the landmarks by channel to make it easier for the network
to identity and associate them. Similarly, facial boundaries and body skeletons can also used.
For audio (speech), the most common approach is to split the audio into segments, and for each seg-
ment, measure the Mel-Cepstral Coecients (MCC) which captures the dominant voice frequencies.
3.5 Deepfake Creation Basics
To generate
x
д
, reenactment and face swap networks follow some variation of this process (illustrated
in Fig. 5): Pass
x
through a pipeline that (1) detects and crops the face, (2) extracts intermediate
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:8 Mirsky, et al.
𝑥
and/or 𝑥
Detect & Crop Intermediate Representation
Preprocessing
Generation Blending
𝑥
𝑥
Postprocessing
Driver and/or
Identify
Landmarks/
key points
Boundaries/
Skeleton
Depth Map
UV Map
3DMM
Parameters
Fig. 5. The processing pipeline for making reenactment and face swap deepfakes. Usually only a subset of
these steps are performed.
representations, (3) generates a new face based on some driving signal (e.g., another face), and then
(4) blends the generated face back into the target frame.
In general there are six approaches to driving an image:
(1) Let a network work directly on the image and perform the mapping itself.
(2)
Train an ED network to disentangle the identity from the expression, and then modify/swap
the encodings of the target the before passing it through the decoder.
(3) Add an additional encoding (e.g., AU or embedding) before passing it to the decoder.
(4)
Convert the intermediate face/body representation to the desired identity/expression before
generation (e.g., transform the boundaries with a secondary network or render a 3D model
of the target with the desired expression).
(5) Use the optical ow eld from subsequent frames in a source video to drive the generator.
(6)
Create composite of the original content (hair, scene, etc) with a combination of the 3D
rendering, warped image, or generated content, and pass the composite through another
network (such as pix2pix) to rene the realism.
3.6 Generalization
A deepfake network may be trained or designed to work with only a specic set of target and source
identities. An identity agnostic model is sometimes hard to achieve due to correlations learned by
the model between s and t during training.
Let
E
be some model or process for representing or extracting features from
x
, and let
M
be a
trained model for performing replacement or reenactment. We identify three primary categories
in regard to generalization:
one-to-one: A model that uses a specic identity to drive a sp ecic identity: x
д
= M
t
(E
s
(x
s
))
many-to-one: A model that uses any identity to drive a specic identity: x
д
= M
t
(E(x
s
))
many-to-many: A model that uses any identity to drive any identity: x
д
= M(E
1
(x
s
),E
2
(x
t
))
3.7 Challenges
e following are some challenges in creating realistic deepfakes:
Generalization.
Generative networks are data driven, and therefore reect the training data in their
outputs. is means that high quality images of a specic identity requires a large number
of samples of that identity. Moreover, access to a large dataset of the driver is typically much
easier to obtain than the victim. As a result, over the last few years, researchers have worked
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:9
hard to minimize the amount of training data required, and to enable the execution of a
trained model on new target and source identities (unseen during training).
Paired Training.
One way to train a neural network is to present the desired output to the model for
each given input. is process of data pairing is a laborious an sometimes impractical when
training on multiple identities and actions. To avoid this issue, many deepfake networks
either (1) train in a self-supervised manner by using frames selected from the same video of
t
,
(2) use unpaired networks such as Cycle-GAN, or (3) utilize the encodings of an ED network.
Identity Leakage.
Sometimes the identity of the driver (e.g.,
s
in reenactment) is partially trans-
ferred to
x
д
. is occurs when training on a single input identity, or when the network is
trained on many identities but data pairing is done with the same identity. Some solutions
proposed by researchers include aention mechanisms, few-shot learning, disentanglement,
boundary conversions, and AdaIN or skip connections to carry the relevant information to
the generator.
Occlusions.
Occlusions are where part of
x
s
or
x
t
is obstructed with a hand, hair, glasses, or any
other item. Another type of obstruction is the eyes and mouth region that may be hidden or
dynamically changing. As a result, artifacts appear such as cropped imagery or inconsistent
facial features. To mitigate this, works such as [
121
,
128
,
145
] perform segmentation and
in-painting on the obstructed areas.
Temporal Coherence.
Deepfake videos oen produce more obvious artifacts such as ickering
and jier [
164
]. is is because most deepfake networks process each frame individually
with no context of the preceding frames. To mitigate this, some researchers either provide
this context to
G
and
D
, implement temporal coherence losses, use RNNs, or perform a
combination thereof.
4 REENACTMENT
In this se ction we present a chronological review of deep learning based reenactment, organized
according to their class of identity generalization. Table 1 provides a summary and systematization
of all the works mentioned in this section. Later, in Section 7, we contrast the various methods and
identify the most signicant approaches.
4.1 Expression Reenactment
Expression reenactment turns an identity into a puppet, giving aackers the most exibility to
achieve their desired impact. Before we review the subject, we note that expression reenactment
has been around long before deepfakes were popularized. In 2003, researchers morphed models of
3D scanned heads [
19
]. In 2005, it was shown how this can b e done without a 3D model [
26
], and
through warping with matching similar textures [
58
]. Later, between 2015 and 2018, ies et al.
demonstrated how 3D parametric models can be used to achieve high quality and real-time results
with depth sensing and ordinary cameras ([156] and [157, 158]).
Regardless, today deep learning approaches are recognized as the simplest way to generate be-
lievable content. To help the reader understand the networks and follow the text, we provide the
model’s network schematics and loss functions in gures 6-8.
4.1.1
One-to-One
(Identity to Identity). In 2017, the authors of [
176
] proposed using a CycleGAN
for facial reenactment, without the ne ed for data pairing. e two domains where video frames of
s
and
t
. However, to avoid artifacts in
x
д
, the authors note that b oth domains must share a similar
distributions (e.g., poses and expressions).
In 2018, Bansal et al. proposed a generic translation network based on CycleGAN calle d Recycle-
GAN [
15
]. eir framework improves temp oral coherence and mitigates artifacts by including
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:10 Mirsky, et al.
Table 1. Summary of Deep Learning Reenactment Models (Body and Face)
Reenactment Retraining for new Model Representation Model Training Model Execution Model Output
Mouth
Expr.
Pose
Gaze
Body
Source (s )
Target (t )
Identity Agnostic
Encoders
Decoders
Discriminators
Other Netw.
AU/AAM
3DMM/Rendering
UV Mapping
Segmentation
Landmark / Keypoint
Boundary / Skeleton
Labeling of: ID
Labeling of: Action
No Pairing
Paring within Same Video
Paring ID to Same ID
Paring ID to Dir. Actions
Paring Action to Dir. IDs
Requires Video
Source (x
s
…)
Target (x
t
…)
Image/Frame
Video
Resolution
[176] 2017 FT-GAN > 20 min. video > 20 min. video 2 2 2 0 portrait portrait 128x128
[15] 2018 Recycle-GAN
5-10 min. video 5-10 min. video 4 4 2 0 portrait - 512x512
[71] 2018 DeepFaceLab
1-3 hr. video 1-3 hr. video 1 2 1 1 portrait video - 512x512
One-to-One
[105] 2019 Liu et al. 2019 1-3 hr. video 1-3 hr. video 4 4 2 1 upperbody video - >256x256
[152] 2017 Syth. Obama None 17 hr. video 0 0 0 1 audio portiat video 2048x1024
[89] 2017 ObamaNet
None 17 hr. video 1 1 1 1 text - 256x256
[83] 2018 Deep Video Portr.
None 1-3 min. video 1 1 1 0 portrait video neural texture 1024x1024
[174] 2018 ReenactGAN
None 30 min. video N N N 1 portrait portriat 256x256
[169] 2018 Vid2vid
or None 3-8 min. video 3 3 2 1 portrait video - 2048x1024
[162] 2018 MocoGAN
or None 1 min. video 2 1 2 N expression label identity label 64x64
[73] 2018 SD-CGAN
None 2 hr. vide o 0 1 1 1 audio - 128x128
[181] 2019 GRN
None 3-10 images o 3 1 0 2 gaze 3-10 eye images 64x128
[55] 2019 TETH
None 1 hr. vide o 1 1 2 0 text portiat video 512x512
[154] 2019 N.V. Puppetry
None 2-3 min. video 3 2 2 4 audio portiat video 512x512
[103] 2019 NRR-HAV
None 8 min. video 1 1 1 0 body image background 512x512
[2] 2019 Deep Video P.C.
None 2 min. video 0 1 2 2 body image - 256x256
[25] 2019 Everyb ody D. N.
None 20 min. video 0 2 4 2 body image - 256x256
[191] 2019 D. D. Generation
None 3 min. video 2 2 2 2 body video - 512x512
[183] 2019 N. Talking Heads
None 1-3 portraits o 1 2 1 1 portrait/landmarks 1-3 portraits 256x256
Many-to-One
[168] 2019 Few-shot Vid2Vid or None 1-10 portraits o 3 3 2 4 portrait/body video 1-10 p ortr./bodies 2048x1024
[143] 2015 Shimba et al. None None 0 0 0 1 audio face database *
[57] 2016 DeepWarp
None None 0 0 0 2 gaze eye image > 40x50
[16] 2017 CVAE-GAN
None None 1 1 1 1 latent variables portrait >128x128
[124] 2017 RDFT
None None 1 1 1 0 portrait portrait 256x256
[190] 2017 FE-CDAE
None None 1 1 2 0 portrait AU label 32x32
[113] 2018 paGAN
None None 1 1 1 1 portrait portrait - neutral 512x512
[172] 2018 X2Face
None None 2 2 0 1 portrait 1-3 portraits 256x256
[135] 2018 GANnotation
None None 1 1 1 3 portrait/landmarks portrait 128x128
[127] 2018 GATH
None None 1 1 1 2 portrait/AUs portrait 100x100
[141] 2018 FaceID-GAN
None None 1 1 2 1 portrait portriat 128x128
[142] 2018 FaceFeat-GAN
None None 1 1 3 4 latent variables portrait 128x128
[70] 2018 CAPG-GAN
None None 1 1 2 1 portrait portrait 128x128
[159] 2018 DR-GAN
None None 1 1 1 0 pose 1+ portraits 96x96
[146] 2018 Deformable GAN
None None 1 1 1 0 body image/landm. body image 256x256
[14] 2018 SHUP
None None 3 3 1 1 body image body image/pose 256x256
[46] 2018 DPIG
None None 4 2 1 0 body image body image 128x64
[117] 2018 Dense Pose Tr.
None None 25 25 1 2 b ody image body image 256x256
[147] 2018 Song et al.
None None 2 1 3 0 audio portrait 128x128
[60] 2019 wg-GAN
None None 2 2 3 0 portrait portrait 256x256
[121] 2019 FSGAN
None None 1 1 1 1 portrait/landmarks portrait 256x256
[128] 2019 GANimation
None None 2 2 1 1 portrait/AUs portrait 128x128
[160] 2019 ICface
None None 2 2 1 2 portrait/AUs portrait 128x128
[185] 2019 FaceSwapNet
None None 4 2 1 0 portrait/landmarks portrait/landmarks 256x256
[144] 2019 Monkey-Net
or None None 3 3 1 0 portrait/body portrait/body 64x64
[145] 2019 First-Order-Model
or None None 3 3 1 1 portrait/body portrait/body 256x256
[125] 2019 M&T GAN
None None 2 1 2 1 expression label portrait 64x64
[48] 2019 AF-VAE
None None 2 1 0 1 portrait/boundaries portrait 256x256
[56] 2019 Fu et al. 2019
None None 3 2 3 4 portrait/label portrait 1024x1024
[186] 2019 FusionNet
None None 1 2 3 3 portriat/landmarks portrait 256x256
[23] 2019 AD-GAN
None None 2 2 2 1 pose portrait 128x128
[164] 2019 Speech D. Anm. 1
None None 5 1 2 3 audio portrait 96x128
[165] 2019 Speech D. Anm. 2
None None 5 1 3 3 audio portrait 96x128
[79] 2019 Speech D. Anm. 3
None None 5 1 3 3 audio portrait 96x128
[189] 2019 DAVS
None None 3 1 1 4 audio/portrait video portrait 256x256
[27] 2019 ATVGnet
None None 1 0 1 5 audio portiat vide o 128x128
[74] 2019 Speech2Vid
None None 3 1 0 2 audio portiat video 109x109
[182] 2019 DwNet
None None 2 1 1 3 body video body image 256x256
[62] 2019 LW-GAN
None None 3 3 1 2 body image body image 256x256
[35] 2019 C-DGPose
None None 2 1 1 0 body image body/pose image 64x64
[193] 2019 PPAT-PIG
None None 2 1 2 1 body image body/pose image 256x256
[171] 2020 ImaGINator
None None 1 1 2 0 expression label portrait 64x64
[65] 2020 MarioNETte
None None 2 2 1 3 portrait 1-8 portraits 256x256
Many-to-Many
[62] 2020 FLNet None None 1 5 1 1 portrait 16 portraits 224x224
next-frame predictor networks for each domain. For facial reenactment, the authors train their
network to translate the facial landmarks of x
s
into portraits of x
t
.
4.1.2
Many-to-One
(Multiple Identities to a Single Identity). In 2017, the authors of [
16
] proposed
CVAE-GAN, a conditional VAE-GAN where the generator is conditioned on an aribute vector
or class label. However, reenactment with CVAE-GAN requires manual aribute morphing by
interpolating the latent variables (e.g., between target poses).
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:11
Later, in 2018, a large number of source-identity agnostic models were published, each proposing
a dierent method to decoupling s from t:
5
Facial Boundary Conversion.
One approach was to rst convert the structure of source’s facial
boundaries to that of the target’s before passing them through the generator [
174
]. In their framework
‘ReenactGAN’, the authors use a CycleGAN to transform the boundary
b
s
to the target’s face shape
as b
t
before generating x
д
with a pix2pix-like generator.
Temporal GANs.
To improve the temporal coherence of deepfake videos, the authors of [
162
]
proposed MoCoGAN: a temporal GAN which generates videos while disentangling the motion and
content (objects) in the process. Each frame is generated using a target expression label
z
c
, and a
motion embedding
z
(i)
M
for the
i
-th frame, obtained from a noise seeded RNN. MoCoGAN uses two
discriminators, one for realism (per frame) and one for temporal coherence (on the last T frames).
In [
169
], the authors proposed a framework called Vid2Vid, which is similar to pix2pix but for
videos. Vid2Vid considers the temporal aspect by generating each frame based on the last
L
source
and generated frames. e model also considers optical ow to perform next-frame occlusion pre-
diction (due to moving objects). Similar to pix2pixHD, a progressive training strategy is to generate
high resolution imagery. In their evaluations, the authors demonstrate facial reenactment using
the source’s facial boundaries. In comparison to MoCoGAN, Vid2Vid is more practical since it the
deepfake is driven by x
s
(e.g., an actor) instead of craed labels.
e authors of [
83
] took temporal deepfakes one step further achieving complete facial reenactment
(gaze, blinking, pose, mouth, etc.) with only one minute of training video. eir approach was to
extract the source and target’s 3D facial models from 2D images using monocular reconstruction,
and then for each frame, (1) transfer the facial pose and expression of the source’s 3D model to the
target’s, and (2) produce
x
д
with a modied pix2pix framework, using the last 11 frames of rendered
heads, UV maps, and gaze masks as the input.
4.1.3 Many-to-Many (Multiple IDs to Multiple IDs).
Label Driven Reenactment.
e rst aempts at identity agnostic models were made in 2017,
where the authors of [
124
] used a conditional GAN (CGAN) for the task. eir approach was to (1)
extract the inner-face regions as
(x
t
,x
s
)
, and then (2) pass them to an ED to produce
x
д
subjected
to
L
1
and
L
adv
losses. e challenge of using a CGAN was that the training data had to be paired
(images of dierent identities with the same expression).
Going one step further, in [
190
] the authors reenacted full portraits at low resolutions. eir
approach was to decoupling the identities was to use a conditional adversarial autoencoder to dis-
entangle the identity from the expression in the latent space. However, their approach is limite d to
driving
x
t
with discreet AU expression labels (xed expressions) that capture
x
s
. A similar label based
reenactment was presented in the evaluation of StarGAN [
29
]; an architecture similar to CycleGAN
but for N domains (poses, expressions, etc).
Later, in 2018, the authors of [
127
] proposed GATH which can drive
x
t
using continuous action
units (AU) as an input, extracted from
x
s
. Using continuous AUs enables smoother reenactments
over previous approaches [
29
,
124
,
190
]. eir generator is ED network trained on the loss signals
from using three other networks: (1) a discriminator, (2) an identity classier, and (3) a pretrained
AU estimator. e classier shares the same hidden weights as the discriminator to disentangle the
identity from the expressions.
5
Although works such as [
124
] and [
190
] achieved fully agnostic models (many-to-many) in 2017, their works were on low
resolution or partial faces.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:12 Mirsky, et al.
𝑥
, 𝑥
, 𝑥
: The source, target, and generated images (e.g., portraits)
: A label (e.g., fake vs real, one-hot encoding, …)
𝑥
: Another sample from the same distribution, 𝑥: reconstructed
: Binary mask, : Segmentation map, : Landmark or Keypoint, : Noise
: Concatenate, : Subtract, :Multiply : Add : Paste content
: Crop out region from image where {f:face, e:eye, m:mouth}
: Create mask using region of the image where {f:face, e:eye, m:mouth}
𝑥
: Image 𝑥 cropped to the region of {:face, :eye, :mouth}
: Spatial replication of a vector (channel-wise or dim-wise)
: Scale image down by factor of X
, , , 3: Landmark, Boundary, Action Unit (AU),
and 3DMM facial model Extractors (open source CV library)
, 3: Landmark and 3D model transformers, from to
: MFCC audio feature extractor
s
÷
x
p
Losses: L
1
: L1, L
2
: L2, L
C E
: Cross Entropy, L
adv
: Adversarial, L
F M
:
Feature Matching, L
pe r c
: Perceptual, L
cyc
: Cycle Consistency, L
at t
: At-
tention, L
t r ip
: Triplet, L
tv
: Total Variance, L
K L
: KL Divergence
[174] Reenact GAN:
Generic Boundary Encoder
𝑏: facial boundaries, 𝑏
: a boundary translated to domain
1
2
𝑏
𝑏
CycleGAN
𝑏
𝑏
Target Specific
Generator
𝑏
𝑏
𝑏
VGG16
Source Generic
Target Specific
[162] MocoGAN:
𝑦
: source expression label,
: one-hot encoding of target identity,
: temporal expression embedding, : Gated Recurrent Unit of an RNN
(1)
(2)
()

1

2

(1)
(2)
()
𝑦
Sample 1
Sample T
=
𝑦
𝑦
Image Disc.
Video Disc.
[169] Vid2Vid:
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜

Intermediate Synth.
Occlusion Masking
Warp Field Pred.
󰇛󰇜
󰇛󰇜
󰇛󰇜
-
+
Warp
󰇛󰇜
󰇛󰇜
󰇛󰇜
Image Disc.
Video Disc.
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
frames in the video clip, : system parameters
VGG19
[83] Deep Video Portrait:
𝑚

: mask of eye region (gaze),

: UV correspondence map,

: 3D rendered
image of
()
3
3

()
,

()

()
,

()
𝑚
()
, 𝑚
()
Refine
()
()
()
,
()
f
f
f
1
:
,
()

:,
()
,
()
,
()
[127] GATH:
𝑥
𝑥



𝑥
,
,
𝑥
shared weights
s

󰆓

󰆒
󰆒

󰆒

[128] GANimation:
: attention mask,
: color mask


-
+
󰆒
s
[135] GANotation:
𝐿𝐸
-1
+1
Light
CNN
VGG19
: attention mask,
: color map, training: and have same ID
[141] FaceID-GAN:
𝑥
𝑥
3


𝑥
𝑥
𝑥
𝑥
3DMM predictor
Identity encoder
[142] FaceFeat-GAN:
3DMM pred.
Encoder
ID predictor
𝑧
𝑧
′

𝑧: Sample of random noise later mapped to
Fig. 6. Architectural schematics of
reenactment networks
. Black lines indicate prediction flows used during
deployment, dashed gray lines indicate dataflows performed during training. Zoom in for more detail.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:13
[113] paGAN:
𝑥
𝑥
𝑥

: UV correspondence map, 𝑥

: 3D rendered image of 𝑥, 𝑥

: image of depth
map of model 𝑥
3
𝑥
3
𝑥


𝑥


𝑥


𝑥
f
e
f
f
𝑥
Refiner
Blend
𝑥
𝑥
𝑥

𝑥


[172] X2Face:
𝑣: vector map of pixel deltas (changes),
: a face with a neutral expression/pose,
: some other modality (e.g., audio)
Encoding Network

interpolate
𝑣


Driving Network
map
OR
𝑣
interpolate
[185] FaceSwapNet:
𝑥
𝑥

1
2

3
𝑥
𝑥
AdaIN
Landmark Converter
Landmark Guided
Generator
[121] FSGAN:
𝑚: segmentation mask (face, hair, other), : 3D facial landmarks,
: passes through
while interpolating
to

𝑚
()
,
()
VGG19
Reenactment
×
[56] Fu et al. 2019:
𝑥
𝑥
Boundary encoder

1
1
2
1
2
𝑥

𝑥
1:3
÷
1,2,4
1:3
𝑥
Texture encoder
Pose predictor
Expr. predictor
Boundary encoder
[160] ICFace:
𝑥
To neutral Gen.
To expression Gen.
𝑥

s
𝑥
𝑥
/
𝑥
, 𝑥
𝑥
, 𝑥
s
OR
𝑥
𝑥
𝑥
𝑥

𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
𝑥
: Action Units (AU), : neutral expression
:
[48] AF-VAE:
𝑥

: Additive Memory Encoder models
as a Gaussian mixture of clustered
facial boundaries.

1

2


𝑥
𝑥
Appearance encoder
Identity encoder
[60] wg-GAN:
𝑣

: vector map of the warp from
to
,

:
warped according to 𝑣

Training: for each
()
,
=
(10)
taken from the same video clip


𝑣


Refinement
1
2
Blend
Occlusion
Inpainting
1
2
2
3
3
𝑣

,
[125] Motion&Texture-GAN:
𝑥
: cropped neutral expression face,
: face expression label of source, : an SRVF
point on a spherical manifold, : landmark reconstruction from , : facial landmarks
𝑥
1

(1)
()
𝑥
(1)
𝑥
()
=
𝑥
𝑥
VGG-face
Texture-GAN
Motion-GAN
[171] ImaGINator:
󰇛󰇜
One-hot label encoding of expression, Random value 󰇛󰇜


󰆒󰇛󰇜
󰆒󰇛󰇜
󰇛󰇜
󰇛󰇜
󰆒󰇛󰇜
󰇛󰇜
Image Disc.
Video Disc.
Fig. 7. Architectural schematics of
reenactment networks
. Black lines indicate prediction flows used during
deployment, dashed gray lines indicate dataflows performed during training. Zoom in for more detail.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:14 Mirsky, et al.
[144] Monkey-NET:
𝑥
𝑥
()
Keypoint Detector
Motion Transfer Gen.


𝑥

c
Warp



Motion Network
𝑥


𝑥
()
: 2D matrix of keypoints, : vector field,

,

: residual and coarse motion fields,
: estimated motion mask
[183] Neural Talking Heads:
𝐿𝐸
(1)
𝐿𝐸
(1)
𝐸
𝐸
1
()
𝐿𝐸
()
AdaIN
[65] MarioNETte:
𝑥
𝑥
()

()


()
𝑥
()
()
()

𝑥

1
2

𝑥
Source encoder
Target encoder

: s landmarks with s expression, : feature maps, : warped feature maps
: Image Attention Block

𝑥
, 𝑥
, 𝑥
hot encoding, …)
𝑥
𝑥 𝑥
{ }
{ }
𝑥
𝑥 { }
   3
, 3



÷
[105] Liu et al. 2019:
𝑥
()


1
normalize
p

𝑥
(1)
𝑥
()
2
2
2
1
1
𝑥
()
: Upper-body Key point Extractor
= 𝑥
()
, 𝑥
()
,
= 
()
, 
()
,
= 𝑥
()
, 𝑥
()
,
𝑥
Face Boundary Pred.
Image Generator
Fig. 8. Architectural schematics of the
reenactment networks
. Black lines indicate prediction flows used
during deployment, dashed gray lines indicate dataflows performed during training. Zoom in for more detail.
Self-Attention Modeling.
Similar to [
127
], another work called GANimation [
128
] reenacts faces
through AU value inputs estimated from
x
s
. eir architecture uses an AU based generator that
uses a self aention model to handle occlusions, and mitigate other artifacts. Furthermore, another
network penalizes
G
with an expression prediction loss, and shares its weights with the discriminator
to encourage realistic expressions. Similar to CycleGAN, GANimation uses a cycle consistency loss
which eliminates the need for image pairing.
Instead of relying on AU estimations, the authors of [
135
] propose GANnotation which uses facial
landmark images. Doing so enables the network to learn facial structure directly from the input
but is more susceptible to identity leakage compared to AUs which are normalized. GANotation
generates
x
д
based on
(x
t
,l
s
)
, where
l
s
is the facial landmarks of
x
s
. e model uses the same self
aention model as GANimation, but proposes a novel “triple consistency loss” to minimize artifacts
in
x
д
. e loss teaches the network how to deal with intermediate p oses/expressions not found in
the training set. Given l
s
,l
t
and l
z
sampled randomly from the same video, the loss is computed as
L
tr ip
= kG(x
t
,l
s
)G(G(x
t
,l
z
),l
s
)k
2
(4)
3D Parametric Approaches.
Concurrent to the work of [
83
], other works also leveraged 3D para-
metric facial models to prevent identity leakage in the generation process. In [
141
], the authors
propose FaceID-GAN which can reenacts
t
at oblique poses and high resolution. eir ED generator
is trained in tandem with a 3DMM face model predictor, where the model parameters of
x
t
are used
to transform
x
s
before being joined with the encoder’s embedding. Furthermore, to prevent identity
leakage from
x
s
to
x
д
, FaceID-GAN incorporates an identication classier within the adversarial
game. e classier has 2
N
outputs where the rst
N
outputs (corresponding to training set identities)
are activated if the input is real and the rest are activated if it’s fake.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:15
Later, the authors of [
141
] proposed FaceFeat-GAN which improves the diversty of the faces while
preserving the identity [
142
]. e approach is to use a set of GANs to learn facial feature distributions
as encodings, and then use these generators to create new content with a decoder. Concretely, three
encoder/predictor neural networks
P
,
Q
, and
I
, are trained on real images to extract feature vectors
from portraits.
P
predicts 3DMM parameters
p
,
Q
encodes the image as
q
capturing general facial
features using feedback from
I
, and
I
is an identity classier trained to predict label
y
i
. Next two
GANs, see ded with noise vectors, produce
p
0
and
q
0
while a third GAN is trained to reconstruct
x
t
from
(p,q,y
i
)
and
x
д
from
(p
0
,q
0
,y
i
)
. To reenact
x
t
, (1)
y
t
is predicted using
I
(even if the identity was
previously unseen), (2)
z
p
and
z
q
are selected empirically to t
x
s
, and (3) the third GAN’s generator
uses
(p
0
,q
0
,y
t
)
to create
x
д
. Although FaceFeat-GAN improves image diversty, it is less practical than
FaceID-GAN since the GAN’s input seed z be selected empirically to t x
s
.
In [
113
], the authors present paGAN, a method for complete facial reenactment of a 3D avatar,
using a single image of the target as input. An expression neutral image of
x
t
is used to generate a 3D
model which is then driven by
x
s
. e driven model is used to create inputs for a U-Net generator: the
rendered head, its U V map, its depth map, a masked image of
x
t
for texture, and a 2D mask indicating
the gaze of
x
s
. Although paGAN is very ecient, the nal deepfake is 3D rendered which detracts
from the realism.
Using Multi-Modal Sources.
In [
172
] the authors propose X2Face which can reenact
x
t
with
x
s
or some other modality such as audio or a pose vector. X2Face uses two ED networks: an embedding
network and a driving network. First the embedding network encodes 1-3 examples of the target’s
face to
v
t
: the optical ow eld required to transform
x
t
to a neutral pose and expression. Next,
x
t
is interp olated according to
m
t
producing
x
0
t
. Finally, the driving network maps
x
s
to the vector map
v
s
, craed to interpolate
x
0
t
to
x
д
, having the pose and expression of
x
s
. During training, rst
L
1
loss
is used between
x
t
and
x
д
, and then an identity loss is used between
x
s
and
x
д
using a pre-trained
identity model trained on the VGG-Face Dataset. All interpolation is performed with a tensorow
interpolation layer to enable back propagation using
x
0
t
and
x
д
. e authors also show how the
embedding of driving network can be mapped to other modalities such as audio and pose.
In 2019, nearly all works pursued identity agnostic models:
Facial Landmark & Boundary Conversion.
In [
185
], the authors propose FaceSwapNet which
tries to mitigate the issue of identity leakage from facial landmarks. First two encoders and a decoder
are used to transfer the expression in landmark
l
s
to the face structure of
l
t
, denoted
l
д
. en a
generator network is use d to convert
x
t
to
x
д
where
l
д
is injected into the network with AdaIn layers
like a Style-GAN. e authors found that it is crucial to use triplet perceptual loss with an external
VGG network.
In [
56
], the authors propose a method for high resolution reenactment and at oblique angles. A
set of networks encode the source’s pose, expression, and the target’s facial boundary for a decoder
that generates the reenacted boundary
b
д
. Finally, an ED network generates
x
д
using an encoding
of
x
t
’s texture in its embedding. A multi-scale loss is used to improve quality and the authors utilize
a small labeled dataset by training their model in a semi-supervised way.
In [
121
], the authors present FSGAN: a face swapping and facial reenactment model which can
handle occlusions. For reenactment a pix2pixHD generator receives
x
t
and the source’s 3D facial
landmarks
l
s
, represented as a 256x256x70 image (one channel for each of the 70 landmarks). e
output is
x
д
and its segmentation map
m
д
with three channels (background, face, and hair). e
generator is trained recurrently where each output is passed back as input for several iterations while
l
s
is interpolated incrementally from
l
s
to
l
t
. To improve results further, delaunay Triangulation and
barycentric coordinate interpolation are used to generate content similar to the target’s pose. In
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:16 Mirsky, et al.
contrast to other facial conversion methods [
56
,
185
], FSGAN uses fewer neural networks enabling
real time reenactment at 30fps.
Latent Space Manipulation.
In [
160
], the authors present a model called ICFace where the expres-
sion, pose, mouth, eye, and eyebrows of
x
t
can be driven independently. eir architecture is similar to
a CycleGANin that one generator translates
x
t
into a neutral expressiondomain as
x
η
t
and another gen-
erator translates
x
η
t
into an expression domain as
x
д
. Both generators ar conditione d on the target AU.
In [
48
] the authors propose an Additive Focal Variational Auto-encoder (AF-VAE) for high quality
reenactment. is is accomplished by separating a C-VAE’s latent code into an appearance encoding
e
a
and identity-agnostic expression coding
e
x
. To capture a wide variety of factors in
e
a
(e.g., age,
illumination, complexion, …), the authors use an additive memory module during training which
conditions the latent variables on a Gaussian mixture model, ed to clustered set of facial boundaries.
Subpixel convolutions were used in the decoder to mitigate artifacts and improve delity.
Warp-based Approaches.
In the past, facial reenactment was done by warping the image
x
t
to the
landmarks
l
s
[
13
]. In [
60
], the authors propose wgGAN which uses the same approach but creates
high-delity facial expressions by rening the image though a series of GANs: one for rening the
warped face and another for in-painting the occlusions (eyes and mouth). A challenge with wgGAN
is that the warping process is sensitive to head motion (change in pose).
In [
186
], the authors propose a system which can also control the gaze: a decoder generates
x
д
with an encoding of
x
t
as the input and a segmentation map of
x
s
as reenactment guidance via SPADE
residual blocks. e authors blend
x
д
with a warped version, guided by the segmentation, to mitigate
artifacts in the background.
To overcome issue of occlusions in the eyes and mouth, the authors of [
62
] use multiple images of
t
as a reference, in contrast to [
60
] and [
186
] which only use one. In their approach (FLNet), the model
is provided with
N
samples of
t
(
X
t
) having various mouth expressions, along with the landmark
deltas between
X
t
and
x
s
(
L
t
). eir model is an ED (congured like GANimation [
128
]) which
produces (1)
N
encodings for a warped
x
д
, (2) an appearance encoding, and (3) a selection (weight)
encoding. e encodings are then coverted into images using seperate CNN layers and merged
together through masked multiplication. e entire model is trained end-to-end in a self supervised
manner using frames of t taken from dierent videos.
Motion-Content Disentanglement.
In [
125
] the authors propose a GAN to reenact neutral expres-
sion faces with smooth animations. e authors describe the animations as temporal curves in 2D
space, summarized as points on a spherical manifold by calculating their square-root velocity function
(SRVF). A WGAN is use d to complete this distribution given target expression labels, and a pix2pix
GAN is used to convert the sequences of reconstructed landmarks into a video frames of the target.
In contrast to MoCoGAN [
162
], the authors of [
171
] propose ImaGINator: a conditional GAN
which fuses both motion and content and uses with transposed 3D convolutions to capture the
distinct spatio-temporal relationships. e GAN also uses a temporal discriminator, and to increase
diversity, the authors train the temporal discriminator with some videos using the wrong label.
A challenge with works such as [
125
] and [
171
] is that they are label driven and produce videos
with a set number of frames. is makes the deepfake creation process manual and less practical. In
contrast, the authors of [
144
] propose Monkey-Net: a self supervised network for driving an image
with an arbitrary video sequence. Similar to MoCoGAN [
162
], the authors decouple the source’s
content and motion. First a series of networks produce a motion heat map (optical ow) using the
source and target’s key-points, and then an ED generator produces
x
д
using
x
s
and the optical ow
(in its embedding).
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:17
Later in [
145
], the authors extend Monkey-Net by improving the object appearance when large
pose transformations occur. ey accomplish this by (1) mo deling motion around the keypoints
using ane transformations, (2) updating the key-point loss function accordingly, and (3) having the
motion generator predict an occlusion mask on the preceding frame for in-painting inference. eir
work has been implemented as a free real-time reenactment tool for video chats, called Avitarify.
6
4.1.4
Few-Shot Learning
. Towards the end of 2019 and into the beginning of 2020, researchers
began looking into minimizing the amount of training data further via one-shot and few-shot learning.
In [
183
], the authors propose a few-shot model which works well at oblique angles. To accomplish
this, the authors perform meta-transfer learning, where the network is rst trained on many dierent
identities and then ne-tuned on the target’s identity. en, an identity encoding of
x
t
is obtained
by averaging the encodings of
k
sets of
(x
t
,l
t
)
. en a pix2pix GAN is used to generate
x
д
using
l
s
as an input, and the identity encoding via AdaIN layers. Unfortunately, the authors note that their
method is sensitive to identity leakage.
In [
168
] the authors of Vid2Vid (Section 4.1.2) extend their work with few-shot learning. ey
use a network weight generation module which utilizes an aention mechanism. e module learns
to extract appearance paerns from a few samples of
x
t
which are injected into the video synthesis
layers. In contrast to FLNet [
62
], [
183
], and [
168
] which merge the multiple representations of
t
before passing it through the generator. is approach is more ecient because it involves fewer
passes through the model’s networks.
In [
65
], the authors propose MarioNETte which alleviates identity leakage when the pose of
x
s
is dierent than
x
t
. In contrast to other works which encode the identity separately or use of AdaIN
layers, the authors use an image aention block and target feature alignment. is enables the model
to b eer handle the dierences between face structures. Finally, the identity is also preserved using
a novel landmark transformer inspired by [21].
4.2 Mouth Reenactment (Dubbing)
In contrast to expression reenactment, mouth reenactment (a.k.a., video or image dubbing) is con-
cerned with driving a target’s mouth with a segment of audio. Fig. 9 presents the relevant schematics
for this section.
4.2.1 Many-to-One (Multiple Identities to a Single Identity).
Obama Puppetry.
In 2017, the authors of [
152
] created a realistic reenactment of former president
Obama. is was accomplished by (1) using a time delayed RNN over MFCC audio segments to gener-
ate a sequence of mouth landmarks (shapes), (2) generating the mouth textures (nose and mouth) by
applying a weighted median to images with similar mouth shapes via PCA-space similarity, (3) rening
the teeth by transferring the high frequency details other frames in the target video, and (4) by using
dynamic programming to re-time the target video to match the source audio and blend the texture in.
Later that year, the authors of [
89
] presented ObamaNet: a network that reenacts an individual’s
mouth and voice using text as input instead of audio like [
152
]. e process is to (1) convert the
source text to audio using Char2Wav [
148
], (2) generate a sequence of mouth-keypoints using a
time-delayed LSTM on the audio, and (3) use a U-Net CNN to perform in-painting on a composite
of the target video frame with a masked mouth and overlayed keypoints.
Later in 2018, Jalalifar et al. [
73
] proposed a network that synthesizes the entire head portrait of
Obama, and therefore does not require pose re-timing and can trained end-to-end, unlike [
152
] and
[
89
]. First, a bidirectional LSTM covertsMFCC audio segments into sequence of mouth landmarks, and
6
hps://github.com/alievk/avatarify
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:18 Mirsky, et al.
[152] Synthesizing Obama:
𝑎
: the -th 25ms segment of audio with a stride of 10ms. : mouth retrieval and
enhancement based on 3DMM reconstructions. : optical flow extractor
𝑎


Step delay=20
()

()
warp

m
()
blend
()
()


Lower
f
Audio to Mouth Landmark
Mouth Landmark to Face
2
:
,
[55] TETH:

: text to be inserted into speech. : A 3DMM video renderer based on
using a
viseme lookup on . *Audio gen not shown (TTS is done procedurally) .


p
Lower
f
m
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
󰇛󰇜
Temporal
Disc.
Spatial
Disc.
󰇛󰇜

[73] SD-CGAN:

:
()

()
(1)
(+1)
1
2
3
3
2
1
()
(1)
(+1)
~

()
Audio to Landmark
Landmark to Face
()
: the -th 33ms segment of audio. : lip landmarks.

: ,
,
,
,

 :
,
[154] Neural Voice Puppetry:
𝑎
: the -th 300ms audio segment with stride 20ms. : content aware filter network
: Neural Texture Extractor
()
𝑎
()

DeepSpeech
Expression
Encoder
()
(:+4)
(:4)
Temporal
Stabilization
()
()
3
3

()

()

m
m
()
1
Deferred Renderer
()
2
()
¬m
()
3DMM
Pred.
Compositor
2
2
()
()
()
()
1
1
()
()
()
[27] ATVGnet:

󰇛󰇜



󰇛󰇜

󰇛󰇜
Identity
Attention

󰇛󰇜
-
+

-
󰇛󰇜

󰇛󰇜
󰇛󰇜
-
+
Audio to Landmark
Landmark to Face
: landmarks compressed with PCA.
󰇛󰇜
: 10ms of audio around the -th frame.
: attention map.
: motion map.

, :
,
,
2
()

: ,
,
/

 :
,
[189] DAVS:
Temporal D.
𝑥
Identity Enc.

Word (video) Enc.

Word (audio) Enc.

()
Word Classif.
1
1
2
2
ID Classif.


𝑥
()
OR
()
: the -th segment of audio containing a word.
: identity label
: word label (one-hot encoding)
[74] Speech2Vid:
𝑥
()
()
𝑥
1
𝑥
5

𝑥
1:5
f
f
Context Enc.
Audio Enc.
Identity Enc.
()
𝑥
()

Skip Connections
(𝑥)
𝑥
𝑥
()
Blend
𝑥
()
USDR CNN
Deblurrer
𝑥
()
Face Recog.
()
: the -th 350ms segment of audio with stride 40ms
m
1
:
,

:
,
2
: 
,
[165] Speech Driven Animation:
𝑎
()
: a 160ms audio segment, shifted according to the frames.
𝑎
()

Audio Enc.
Identity Enc.
~
()
()
Frame Disc.


(5:)
(5:)
Lower
f
shift

Sync Disc.

Sequence Disc.
(1:)
(1:)


Fig. 9. Architectural schematics for some
mouth reenactment networks
. Black lines indicate prediction
flows used during deployment, dashed gray lines indicate dataflows performed during training.
then a pix2pix like network generates frames using the landmarks and a noise signal. Aer training,
the pix2pix network is ne-tuned using a single video of the target to ensure consistent textures.
3D Parametric Approaches.
Later on in 2019, the authors of [
55
] proposed a method for editing
a transcript of a talking heads which, in turn, modies the target’s mouth and speech accordingly.
e approach is to (1) align phenomes to
a
s
, (2) t a 3D parametric head model to each frame of
X
t
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:19
like [
83
], (3) blend matching phenomes to create any new audio content, (4) animate the head model
with the respective frames used during the blending process, and (5) generate
X
д
with a CGAN RNN
using composites as inputs (rendered mouths placed over the original frame).
e authors of [
154
] had a dierent approach: (1) animate a the reconstructed 3D head with the
predicted blend shape parameters from
a
s
using a DeepSpeech model for feature extraction, (2)
use Deferred Neural Rendering [
155
] to generate the mouth region, and then (3) use a network to
blend the mouth into the original frame. Compared to previous works, the authors found that their
approach only requires 2-3 minutes of video while producing very realistic results. is is because
neural rendering can summarize textures with a high delity and operate on UV maps –mitigating
artifacts in how the textures are mapped to the face.
4.2.2
Many-to-Many
(Multiple IDs to Multiple IDs). One of the rst works to perform identity
agnostic video dubbing was [
143
]. ere the authors used an LSTM to map MFCC audio segments
to the face shape. e face shapes were represented as the coecients of an active appearance model
(AAM), which were then used to retrieve the correct face shape of the target.
Improvements in Lip-sync.
Noting a human’s sensitivity to temporal coherence, the authors of
[
147
] use a GAN with three discriminators: on the frames, video, and lip-sync. Frames are gener-
ated by (1) encoding each MFCC audio segment
a
(i)
s
and
x
t
with separate encoders, (2) passing the
encodings through an RNN, and (3) decoding the outputs as x
(i)
д
using a decoder.
In [
179
] the authors try to improve the lipsyncing with a textual context. A time-delayed LSTM is
used to predict mouth landmarks given MFCC segments and the spoken text using a text-to-spee ch
model. e target frames are then converted into sketches using an edge lter and the predicted
mouth shapes are composited into them. Finally, a pix2pix like GAN with self-aention is used to
generate the frames with both video and image conditional discriminators.
Compared to direct models such as direct models [
147
,
179
], the authors of [
27
] improve the
lip-syncing by preventing the model from learning irrelevant correlations between the audiovisual
signal and the speech content. is was accomplished with LSTM audio-to-landmark network and
a landmark-to-identity CNN-RNN used in sequence. ere, the facial landmarks are compressed
with PCA and the aention mechanism from [
128
] is used to help focus the model on the relevant
paerns. To improve synchronization further, the authors proposed a regression based discriminator
which considers both se quence and content information.
EDs for Preventing Identity Leakage.
e authors in [
189
] mitigate identity leakage by disentan-
gling the speech and identity latent spaces using adversarial classiers. Since their speech encoder is
trained to project audio and video into the same latent space, the authors show how
x
д
can b e driven
using x
s
or a
s
.
In [
74
], the authors propose Speech2Vid which also uses separate encoders for audio and identity.
However, to capture the identity beer, the identity encoder
En
I
uses a concatenation of ve images
of the target, and there are skip connections from the
En
I
to the decoder. To blend the mouth in
beer, a third context’ encoder is used to encourage in-painting. Finally, a VDSR CNN is applied
to x
д
to sharpen the image.
A disadvantage with [
189
] and [
74
] is that they cannot control facial expressions and blinking.
To resolve this, the authors in [
164
] generate frames with a stride transposed CNN decoder on
GRU-generated noise, in addition to the audio and identity encodings. eir video discriminator
uses two RNNs for both the audio and video. When applying the L1 loss, the authors focus on the
lower half of the face to encourage b eer lip sync quality over facial expressions.
Later in [
165
], the same authors improve the temporal coherence by spliing the video discrim-
inator into two: (1) for temporal realism in mouth to audio synchronization, and (2) for temporal
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:20 Mirsky, et al.
realism in overall facial expressions. en in [
79
], the authors tune their approach further by fusing
the encodings (audio, identity, and noise) with a polynomial fusion layer as opposed to simply
concatenating the encodings together. Doing so makes the network less sensitive to large facial
motions compared to [165] and [74].
4.3 Pose Reenactment
Most deep learning works in this domain focus on the problem of face frontalization. However, there
are some works which focus on facial pose reenactment.
In [
70
] the authors use a U-Net to convert
(x
t
,l
t
,l
s
)
into
x
д
using a GAN with two discriminators:
one conditioned with the neutral pose image, and the other conditioned with the landmarks. In [
159
],
the authors propose DR-GAN for pose-invariant face recognition. To adjust the pose of
x
t
, the authors
use an ED GAN which encodes
x
t
as
e
t
, and then decodes
(e
t
,p
s
,z)
as
x
д
, where
p
s
is the source’s pose
vector and
z
is a noise vector. Compared to [
70
], [
159
] has the exibility of manipulating the encodings
for dierent tasks and the authors improve the quality of
x
д
by averaging multiple examples of the
identity encoding before passing it through the decoder (similar to [
62
,
168
,
183
]). In [
23
], the authors
suggest using two GANs: e rst frontalizes the face and produces a UV map, and second rotates
the face, given the target angle as an injected embedding. e result is that each model performs a
less complex operation and can therefore the models colle ctively can produce a higher quality image.
4.4 Gaze Reenactment
ere are only a few deep learning works which have focused on gaze reenactment. In [
57
] the authors
convert a cropped eye
x
t
, its landmarks, and the source angle, to a ow (vector) eld using a 2-scale
CNN.
x
д
is then generated by applying a ow eld to
x
t
to warping it to the source angle. e authors
then correct the illumination of
x
д
with a second CNN. A challenge with [
57
] is that the head must
be frontal to avoid inconsistencies due to pose and perspective. To mitigate this issue, the authors of
[
181
] proposed the Gaze Redirection Network (GRN). In GRN, the target’s cropped eye, head pose, and
source angle are encoded separately and then passed though an ED network to generate an optical
ow eld. e eld is used to warp
x
t
into
x
д
. To overcome the lack of training data and the challenge
of data pairing, the authors (1) pre-train their network on 3D synthesized examples, (2) further tune
their network on real images, and then (3) ne tune their network on 3-10 examples of the target.
4.5 Body Reenactment
Several facial reenactment papers from Section 4.1 discuss body reenactment too. For example,
Vid2Vid [
168
,
169
], MocoGAN [
162
], and others [
144
,
145
]. In this se ction, we focus on methods
which specically target body reenactment. Schematics for some of these architectures can be found
in Fig. 10.
4.5.1
One-to-One
(Identity to Identity). In the work [
105
], the authors perform facial reenact-
ment with the upper-body as well (arms and hands). e approach is to (1) use a pix2pixHD GAN
to convert the source’s facial boundaries to the targets, (2) and then paste them onto a captured pose
skeleton of the source, and (3) use a pix2pixHD GAN to generate x
д
from the composite.
4.5.2 Many-to-One (Multiple Identities to a Single Identity).
Dance Reenactment.
In [
25
] the authors make people dance using a target specic pix2pixHD
GAN with a custom loss function. e generator receives an image of the captured pose skeleton and
the discriminator receives the current and last image conditioned on their poses. e quality of face
is then improved with a residual predicted by an additional pix2pixHD GAN, given the face region of
the pose. A many-to-one relationship is achieved by normalizing the input pose to that of the target’s.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:21
[25] Everybody Dance Now:
Pose Predictor
𝑃
()
()
Normalize

()
Body Generator
Face Refiner
()
p
()
(1)
VGGM
()
1:3
÷
1,2,4
1:3
÷
1,2,4
(1)
()
(1)

()
(1)

()
f
f
f


: residual

:

,


:
, 
, 
()

:
, 
, 
()

:
,

:
+1
,
+1
[103] NRR-HAV:
Attention
3𝐷𝐶: produces a textures 3D body mesh using photogrammetry software.
3𝐷: Motion capture.
: target background image. : Hadamard product
3𝐷
3𝐷 3𝐷


limbs

𝐷

-1
| |
[2] Deep Vid. Perf. Cloning:
𝑥
𝑥
Pose Predictor
𝑥
()
()
𝑥
(+1)
(+1)
Normalize
()
(+1)
𝑥
𝑥
,
()
𝑥
,
(+1)
2
2
1
1
(
)
𝑥
VGGM
𝑥
Markovian
Disc.

Warp
𝑥
,
(+1)
shared weights
𝑥
: an image with an identity not or . : optical flow field extractor (between frames)
ℎ
:
,
+1
,
+1

:
ℓ1
,
1
:
1
,
/
2
:
2
,
/
[182] DwNet:
𝑢: UV map to the body, segmented by limb. : Warp Guided Extractor. : warp grid
(1)
()
Pose
Predictor
DensePose
Network
𝑢
(1)
𝑢
()
𝑢
(1)
Pose Enc.

𝑢
()
𝑢
(1)
Cur. pose
Last pose
warp
Appearance Enc.
Refiner
warp
Warp Module ()
()
VGG19
Fig. 10. Architectural schematics for some
body reenactment networks
. Black lines indicate prediction
flows used during deployment, dashed gray lines indicate dataflows performed during training.
e authors of [
103
] then tried to overcome artifacts which occur in [
25
] such stretched limbs due
to incorrectly detected pose skeletons. ey used photogrammetry soware on hundreds of images of
the target, and then reenacted the 3D rendering of the target’s body. e rendering, partitioned depth
map, and background are then passed to a pix2pix model for image generation, using an aention loss.
Another artifact in [
25
] was that the model could not generalize well to unseen poses. To improve
the generalization, the authors of [
2
] trained their network on many identities other than
s
and
t
.
First they trained the GAN on paired data (the same identity doing dierent poses) and then later
added another discriminator to evaluate the temporal coherence given (1)
x
(i)
д
driven by another
video, and (2) the optical ow predicted version.
A challenge with the previous works was that they required a lots of training data. is was reduced
from about an hour of video footage to only 3 minutes in [
191
] by segmenting and orienting the
limbs of
x
t
according to
x
s
before the generation step. en a pix2pixHD GAN uses this composition
and the last
k
frames’ poses to generate the body. Finally, another pix2pixHD GAN is used to blend
the body into the background.
4.5.3 Many-to-Many (Multiple IDs to Multiple IDs).
Pose Alignment.
In [
146
] the authors tr y to resolve the issue of misalignment when using pix2pix
like architectures. ey propose ‘deformable skip connections’ which help orient the shuled feature
maps according to the source pose. e authors also propose a novel nearest neighbor loss instead
of using L1 or L2 losses. To modify unseen identities at test time, an encoding of
x
t
is passed to the
decoder’s inner layers.
Although the work of [
146
] helps align the general images, artifacts can still occur when
x
s
and
x
t
have very dierent poses. To resolve this, the authors of [193] use novel Pose-Aentional Trans-
fer blocks (PATB) inside their GAN-based generator. e architecture passes
x
t
and the poses
p
s
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:22 Mirsky, et al.
concatenated with
p
t
through separate encoders which are passed though a series of PATBs before
being decoded. e PATBs progressively transfer regional information of the poses to regions of
the image to ultimately create a bo dy that has beer shape and appearance consistency.
Pose Warping.
In [
117
] the authors use a pre-trained DensePose network [
9
] to rene a predicted
pose with a warped and in-painted DensePose UV spatial map of the target. Since the spatial map
covers all surfaces of the body, the generated image has improved texture consistency. In contrast to
[
146
,
193
] which uses feature mappings to alleviate misalignment, the authors of [
182
] use warping
which reduces the complexity of the network’s task. eir model, called DwNet, uses a ‘warp module’
in an ED network to encode
x
(i1)
t
warped to
p
(i)
s
, where
p
is a UV body map of a pose obtained a
DensePose network.
A challenge with the alignment techniques of the previous works is that the body’s 3D shape and
limb scales are not considered by the network resulting in identity leakage from
x
s
. In [
104
], the
authors counter this issue with their Liquid Warping GAN. is is accomplished by predicting target
and source’s 3D bodies with the model in [
77
] and then by translating the two through a novel liquid
warping block (LWB) in their generator. Specically, the estimated UV maps of
x
s
and
x
t
, along with
their calculated transformation ow, are passed through a three stream generator which produces
(1) the background via in-painting, (2) a reconstruction of the
x
s
and its mask for feature mapping,
and (3) the reenacted foreground and its mask. e laer two streams use a shared LWB to help
the networks address multiple sources (appearance, p ose, and identity). e nal image is obtained
through masked multiplication and the system is trained end-to-end.
Background Foreground Compositing.
In [
14
], the authors break the process down into three
stages, trained end-to-end: (1) use a U-Net to segment
x
t
’s body parts and then orient them according
to the source pose
p
s
, (2) use a second U-Net to generate the body
x
д
from the composite, and (3) use
a third U-Net to perform in-painting on the background and paste
x
д
into it. e authors of [
46
] then
streamlined this process by using a single ED GAN network to disentangle the foreground appearance
(body), background appearance, and pose. Furthermore, by using an ED network, the user gains
control over each of these asp ects. is is accomplished by segmenting each of these aspects before
passing them through encoders. To improve the control over the compositing, the authors of [
35
]
used a CVAE-GAN. is enabled the authors to change the pose and appearance of bodies individually.
e approach was to condition the network on heatmaps of the predicted pose and skeleton.
4.5.4
Few-Shot Learning
. In [
91
], the authors demonstrate the few-shot learning technique
of [
53
] on a pix2pixHD network and the network of [
14
]. Using just a few sample images, they were
able to transfer the resemblance of a target to new videos in the wild.
5 REPLACEMENT
e network schematics and summar y of works for replacement deepfakes can be found in Fig. 12
and Table 2 respectively.
5.1 Swap
At rst, face swapping was a manual process accomplished using tools such as Photoshop. More
automated systems rst appeared between 2004-08 in [
20
] and [
18
]. Later, fully automated methods
were proposed in [
34
,
80
,
163
] and [
122
] using methods such as warping and reconstructed 3D
morphable face models.
5.1.1 One-to-One (Identity to Identity).
Online Communities.
Aer the Reddit user ‘deepfakes’ was exposed in the media, researchers
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:23
𝑥
𝑥
𝑥
𝑥
f
f

OR
𝑥
𝑥

𝑥
𝑥
f
𝑥
p
𝑥
Basic Losses:
1
(,
):
𝑥
, 𝑥
1
,
: 𝑥
, 𝑥
DeepFaceLab (2019):
2
,
: 
𝑥
, 
𝑥
2
,
: 
𝑥
, 
𝑥
Fig. 11. The basic schematic for the Reddit deepfakes’ model and its variants[1, 71, 139].
and online communities began nding improved ways to perform face swapping with deep neural
networks. e original deepfake network, published by the Reddit user, is an ED network (visualized
in Fig. 11). e architecture consists of one encoder
En
and two decoders
De
s
and
De
t
. e compo-
nents are trained concurrently as two autoencoders:
De
s
(En(x
s
))=
ˆ
x
s
and
De
t
(En(x
t
))=
ˆ
x
t
, where
x is a cropped face image. As a result, En learns to map s and t to a shared latent space, such that
De
s
(En(x
t
))= x
д
(5)
Currently, there are a number of open source face swapping tools on GitHub based on the original
network. One of the most popular is DeepFaceLab [
71
]. eir current version oers a wide variety
of model congurations, including adversarial training, residual blocks, a style transfer loss, and
masked loss to improve the quality of the face and eyes. To help the network map the target’s identity
into arbitrary face shapes, the training set is augmented with random face warps.
Another tool calle d FaceSwap-GAN [
139
] follows a similar architecture, but uses a denoising
autoencoder with a self-aention mechanisms, and oers cycle-consistency loss which can reduce
the identity leakage and increase the image delity. e decoders in FaceSwap-GAN also generate seg-
mentation masks which helps the model handle occlusions and is used to blend
x
д
back into the target
frame. Finally, [
1
] is another open source tool that provides a GUI. eir soware comes with 10 pop-
ular implementations, including that of [
71
], and multiple variations of the original Redit user’s code.
5.1.2 One-to-Many (Single Identity to Multiple Identities).
In [
88
], the authors use a modied style transfer with CNN, where the content is
x
t
and the style
is the identity of
x
s
. e process is (1) align
x
t
to a reference
x
s
, (2) transfer the identity of
s
to the
image using a multi scale CNN, trained with style loss on images of
s
, and (3) align the output to
x
t
and blend the face back in with a segmentation mask.
5.1.3 Many-to-Many (Multiple IDs to Multiple IDs).
One of the rst identity agnostic methods was [
124
], mentioned in Section 4.1.3. However, to train
this CGAN, one needs a dataset of paired faces with dierent identities having the same expression.
Disentanglement with EDs.
However, To provide more control over the In [
17
] the authors us
an ED to disentangle the identity from the aributes (pose, hair, background, and lighting) during
the training process. e identity encodings are the last pooling layer of a face classier, and the
aribute encoder is trained using a weighted L2 lossand a KL divergence loss to mitigate identity
leakage. e authors also show that they can adjust aributes, expression, and pose via interpolation
of the encodings. Instead of swapping identities, the authors of [
151
] wanted to variably obfuscate
the target’s identity. To accomplish this, the authors used an ED to predict the 3D head parameters
which where either modied or replaced with the source’s. Finally a GAN was used to in-paint the
face of x
t
given the modied head model parameters.
Disentanglement with VAEs.
In [
115
], the authors propose RSGAN: a VAE-GAN consisting of
two VAEs and a de coder. One VAE encodes the hair region and the other encodes the face region,
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:24 Mirsky, et al.
Table 2. Summary of Deep Learning Replacement Models
Replacement Retraining for new… Mo del Repr. Model Training Model Execution Model Outp.
Transfer
Swap
Source (s)
Target (t )
Identity Agnostic
Encoders
Decoders
Discriminators
Other Netw.
3DMM/Rendering
Segmentation
Landmark / Keypoint
Labeling of: ID
Labeling of: Other
No Pairing
Paring within Same Video
Paring ID to Dir. Actions
Requires Video
Source (x
s
…)
Target (x
t
…)
Resolution
[1] 2017 Deepfakes for All 2k-5k portraits 1 2 0-1 0 - portrait 256x256
[139] 2018 FaceSwap-GAN
2k-5k portraits 1 2 2 1 - portrait 256x256One-to-One
[71] 2018 DeepFaceLab
2k-5k portraits 1 2 0-1 0 - portrait 256x256
One-to-Many [88] 2017 Fast Face Swap 60 portraits None 0 0 0 2 p ortrait portrait 256x256
[115] 2018 RSGAN None None 4 3 2 1 portrait portrait 128x128
[114] 2018 FSNet
None None 3 4 5 0 portrait portrait 128x128
[17] 2018 OSIP-FS
None None 2 1 2 0 portrait portrait 128x128
[112] 2018 DepthNets
None None 3 2 2 1 portrait portrait 80x80
[121] 2018 FSGAN
None None 4 4 3 1 portrait p ortrait 256x256
[151] 2018 IO-FR
None None 1 1 1 1 portrait portrait 256x256
[140] 2019 FS Face Trans.
None None 1 1 2 2 p ortraits portrait 128x128
[175] 2019 IHPT
None None 2 1 2 0 cropped cropped 128x128
Many-to-Many
[93] 2019 FaceShier None None 3 3 3 0 portrait portrait 256x256
where both are conditioned on a predicted aribute vector
c
describing
x
. Since VAEs are used, the
facial aributes can be edited through c.
In contrast to [
115
], the authors of [
114
] use a VAE to prepare the content for the generator, and
use a network to perform the blending via in-painting. A single VAE-ED network is run on
x
s
and
then
x
t
producing encodings for the face of
x
s
and the landmarks of
x
t
. To perform a face swap,
a generator receives the masked portrait of
x
t
and performs in-painting on the masked face. e
generator uses the landmark encodings in its embedding layer. During training, randomly generated
faces are used with triplet loss on the encodings to preser ve identities.
Face Occlusions.
FSGAN [
121
], mentioned Section 4.1.3, is also capable of face swapping and can
handle occlusions. Aer the face reenactment generator produces
x
r
, a second network predicts
the target’s segmentation mask
m
t
. en
(x
hf i
r
, m
t
)
is passed to a third network that performs
in-painting for occlusion correction. Finally a fourth network blends the corrected face into
x
t
while
considering ethnicity and lighting. Instead of using interpolation like [
121
], the authors of [
93
]
propose FaceShier which uses novel Adaptive Aentional Denormalization layers (AAD) to transfer
localized feature maps between the faces. In contrast to [
121
], FaceShier reduces the number of
operations by handling the occlusions through a renement network trained to consider the delta
between the original x
t
and a reconstructed
ˆ
x
t
.
5.1.4
Few-Shot Learning
. e same author of FaceSwap-GAN [
139
] also hosts few-shot ap-
proach online dubbed “One Model to Swap em All” [
140
]. In this version the generator receives
(x
hf i
s
,x
hf i
t
,m
t
)
where its encoder is conditioned on VGGFace2 features of
x
t
using FC-AdaIN lay-
ers, and its decoder is conditioned on
x
t
and the face structure
m
t
via layer concatenations and
SPADE-ResBlocks respectively. Two discriminators are used: one on image quality given the face
segmentation and the other on the identities.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:25
[139] Face Swap GAN:
𝑥
𝑥
𝑥
𝑥
f
f

OR

𝑥
𝑥
f
𝑥
p
𝑥
Warp
Warp
𝑥
𝑥
: segmentation mask (face)
𝑥
𝑥
VGG
𝑥
𝑥

:



():

():
:
,

():
,
/

():
,
/
1
():
,
1
():
,

():
,

():
,

():


,

():
,
[88] Fast Face Swap:
𝑥
𝑥

Align
Style Transfer Network
𝑥
÷
÷
÷
÷
÷
2
2
2
2
2
𝑥
Light Measure
Network
Siamese Network
𝑥

:

÷
÷
÷
÷
÷

:
,

:
,

:
,

:
[17] OSIP-FS:
𝑥
𝑥
Identity Enc.
Attribute Enc.

𝑥
Identity
Discriminator
𝑥


󰆒

-weighted




󰇛
󰇜

󰇛
󰇜
󰆒


󰇛󰇜
[115] RSGAN:
Face VAE
Hair VAE
Attribute
Predictor
𝐸𝑛
ℎ1
𝐸𝑛
ℎ2

𝐸𝑛
1
𝐸𝑛
2


1
1
2
2
Global
Patch
: Hair region

:

ℎ1

ℎ2


1

2


1
1
2
2
1
:
,
1
:
,
1
: ,

: , 
 1
: ,
Τ
 2
: ,
Τ

: ,
[114] FSNet:
ED Network (EN)
Face Encoder
𝑥


Landmark Enc.
𝑥
𝑥
𝑥
EN
EN
1
𝑥
𝑥
Patch disc.
𝑥
𝑥
𝑥



󰆒
󰆒
󰆒
󰆒
󰆒
VAE Objectives:




󰆹





󰆒







󰆒


󰆒


 󰇛


,
󰇜
,
(

,

󰇜
(
,
󰇜
[93] FaceShier:
Identity Enc.
Attribute Encoder
Face
Recog. Net

-
Occlusion Correction
󰆒
1,2,4
1,2,4


󰆒








-if s=t

-if s=t
[140] Few-Shot Face Translation:
Face Segmentation
𝑆
VGGFace2
1
1
2
2
SPADE
AdaIN

:
1
1
2
2
Losses unknown: Training code was not released as of writing
[112] Depth Nets:
𝐸𝑛
𝐸𝑛
shared
weights
𝐸
𝐸

f
Warp
¬f
p
OpenGL


CycleGAN
Blend Repair

: affine transformation parameters and depth measures







󰆒


󰇛 󰇜






󰆓
[175] IHPT:
𝑥
𝑥
Identity Enc.
Pose Enc.

𝑥
𝑥
Video ID Classifier
1
1
2
2
2
2
𝑥
÷
÷
1
1
2
2
2
2
𝑥
÷
÷
Realism D.
Identity D.

:

1
1
2
2
÷
÷
1
1
2
2
÷
1
:
,
1
:
,
Ƹ
1
:
,
Ƹ

:
,
,

, : ,
 1
:
1
,
Τ
 2
:
2
,
Τ
 1
:
1
,
Τ
 2
:
2
,
Τ
2
÷
2
÷
[121] FSGAN:
𝑚: segmentation mask (face, hair, other), : 3D facial landmarks,
: passes through while interpolating
to

Reenactment
×
𝑚
()
,
()
VGG19
2
1
𝑚
Segmentation
3
f
p
𝑚
4
Occlusion
in-painting
Blending
1
1
3
3
3
4
1
:

×
()
,
()
2
1
3
4
1
1
3
3
4
4
1
1
:
,

1
:
,
1
:
1
,
/
1
1
:
,
1
2
:
,
1,
1
3
:
,

(
3
):
,
3
:
3
,
/
4
:
4
,
/

4
:
,
, (
,
)
Fig. 12. Architectural schematics of the
replacement networks
with their generation and training dataflows.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:26 Mirsky, et al.
5.2 Transfer
Although face transfers precede face swaps, today there are very few works that use deep learning
for this task. However, we note that face a transfer is equivalent to performing self-reenactment on
a face swapped portrait. erefore, high quality face a transfers can be achieved by combining a
method from Section 4.1 and Section 5.1.
In 2018, the authors of [
112
] proposed DepthNets: an unsupervised network for capturing facial
landmarks and translating the pose from one identity to another. e authors use a Siamese network
to predict a transformation matrix that maps the
x
s
’s 3D facial landmarks to the corresponding 2D
landmarksof
x
t
. A3D renderer(OpenGL) is then used to warp
x
hf i
s
tothe source pose
l
t
, and the compo-
sition is rened using a CycleGAN. Since warping is involved, the approach is sensitive to occlusions.
Later in 2019, the authors of [
175
] proposed a self-supervised network which can change the
identity of an object within an image. eir ED disentangles the identity from an objects p ose using
a novel disentanglement loss. Furthermore to handle misaligned poses, an L1 loss is computed using
a pixel mapped version of
x
д
to
x
s
(using the weights of the identity encoder). Similarly, the authors
of [
100
] proposed a method disentangle d identity transfer. However neither [
175
] or [
100
] were
explicitly performed on faces.
6 COUNTERMEASURES
In general, countermeasures to malicious deepfakes can be categorized as either detection or preven-
tion. We will now briey discuss each accordingly. A summary and systematization of the deepfake
detection methods can be found in Table 3.
6.1 Detection
e subject of image forgery detection is a well researched subject [
188
]. In our review of detection
methods, we will focus on works which specically deal with detecting deepfakes of humans.
6.1.1 Artifact-Specific. Deepfakes oen generate artifacts which may be subtle to humans, but can
be easily detected using machine learning and forensic analysis. Some works identify deepfakes by
searching for specic artifacts. We identify seven types of artifacts: Spatial artifacts in blending, envi-
ronments, and forensics; temporal artifacts in behavior, physiology, synchronization, and coherence.
Blending (spatial).
Some artifacts appear where the generated content is blended back into the
frame. To help emphasize these artifacts to a learner, researchers have proposed edge detectors,
quality measures, and frequency analysis [
4
,
8
,
42
,
111
,
187
]. In [
95
] the authors follow a more explicit
approach to detecting the boundary. ey trained a CNN network to predict an image’s blending
boundary and a label (real or fake). Instead of using a deepfake dataset, the authors trained their
network on a dataset of face swaps generated by splicing similar faces found through facial landmark
similarity. By doing so, the model has the advantage that is focuses on the blending boundary and
not other artifacts caused by the generative model.
Environment (spatial).
e content of a fake face can be anomalous in context to the rest of the
frame. For example, residuals from face warping processes [
98
,
99
,
101
], lighting [
150
], and varying
delity [
86
] can indicate the presence of generated content. In [
96
], the authors follow a dierent
approach by contrasting the generated foreground to the (untampered) background using a patch and
pair CNN. e authors of [
123
] also contrast the fore/background but enable a network to identify
the distinguishing features automatically. ey accomplish this by (1) encoding the face and context
(hair and background) with an ED and (2) passing the dierence between the encodings with the
complete image (encoded) to a classier.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:27
Forensics (spatial).
Several works detect deepfakes by analyzing subtle features and paerns le
by the model. In [
180
] and [
107
], the authors found that GANs leave unique ngerprints and show
how it is possible to classify the generator given the content, even in the presence of compression and
noise. In [
85
] the authors analyze a camera’s unique sensor noise (PRNU) to detect pasted content.
To focus on the residuals, the authors of [
108
] use a two stream ED to enco de the color image and
a frequency enhanced version using “Laplacian of Gaussian layers” (LoG). e two encodings are
then fed through an LSTM which then classies the video base d on a sequence of frames.
Instead of searching for residuals, the authors of [
178
] search for imperfections and found that
deepfakes tend to have inconsistent head poses. erefore, they detect deepfakes by predicting and
monitoring facial landmarks. e authors of [
167
] had a dierent approach by training classier
to focus on the imperfections instead of the residuals. is was accomplished by using a dataset
generated using a ProGAN instead of other GANs since the ProGAN’s images contain the least
amount of frequency artifacts. In contrast to [
167
], the authors in [
64
] use a network to emphasize the
residuals and suppress the imperfections in a preprocessing step for a classier. eir network uses
adaptive convolutional layers that predict residuals to maximize the artifacts’ inuence. Although
this approach may help the network identify artifacts beer, it may not generalize as well to new
types of artifacts.
Behavior (temporal).
With large amounts of data on the target, mannerisms and other behaviors
can be monitored for anomalies. For example, in [
6
] the authors protect world leaders from a wide
variety of deepfake aacks by modeling their recorded stock footage. Recently, the authors of [
110
]
showed how behavior can be used with no reference footage of the target. e approach is to detect
discrepancies in the perceived emotion extracted from the clip’s audio and video content. e authors
use a custom Siamese network to consider the audio and video emotions when contrasted to real
and fake videos.
Physiology (temporal).
In 2014, researchers hypothesized that generated content will lack physio-
logical signals and identied computer generated faces by monitoring their heart rate [
32
]. Regarding
deepfakes, [
30
] monitored blood volume paerns (pulse) under the skin, and [
97
] took a more robust
approach by monitoring irregular eye blinking paerns. Instead of detecting deepfakes, the authors
of [31] use the pulse signal to help determine the model used to create the deepfake.
Synchronization (temporal).
Inconsistencies are also a revealing factor. In [
87
] and [
47
], the
authors noticed that video dubbing aacks can be detected my correlating the speech to landmarks
around the mouth. Later, in [
5
], the authors rened the approach by detecting when visemes
(mouth shapes) are inconsistent with the spoken phonemes (uernaces). In particular, they focus
on phonemes where the mouth is fully closed (B, P, M) since deepfakes in the wild tend to fail in
generating these visemes.
Coherence (temporal).
As noted in Section 4.1, realistic temporal coherence is challenging to gen-
erate, and some authors capitalize on the resulting artifacts to detect the fake content. For example,
[
63
] uses an RNN to detect artifacts such as ickers and jier, and [
132
] uses an LSTM on the face
region only. In [
25
] a classier is trained pairs of sequential frames and in [
11
] the authors rene
the network’s focus by monitoring the frames’ optical ow. Later the same authors use an LSTM
to predict the next frame, and expose deepfakes when the reconstruction error is high [10].
6.1.2 Undirected Approaches. Instead of focusing on a specic artifact, some authors train deep
neural networks as generic classiers, and let the network decide which features to analyze. In
general, researchers have taken one of two approaches: classication or anomaly detection.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:28 Mirsky, et al.
Classication.
In [
106
,
119
,
131
], it was shown that deep neural networks tend to perform beer
than traditional image forensic tools on compressed imagery. Various authors then demonstrate d
how standard CNN architectures can eectively detect deepfake videos [
3
,
38
,
39
,
153
]. In [
69
], the
authors train the CNN as a Siamese network using contrasting examples of real and fake images.
In [
52
], the authors were concerned that a CNN can only detect the aacks on which they trained. To
close this gap, the authors propose using Hierarchical Memory Network (HMN) architecture which
considers the contents of the face and previously seen faces. e network encodes the face region
which is then processed using a bidirectional GRU while applying an aention mechanism. e nal
encoding is then passed to a memory module, which compares it to recently seen encodings and
makes a prediction. Later, in [
129
], the authors use an ensemble approach and leverage the predictions
of seven deepfake CNNs by passing their predicitons to a meta classifer. Doing so produces results
which are more robust (fewer false positives) than using any single model. In [
36
], the authors tried a
variety of dierent classic spatio-temproal networks and feature extractors as a baseline for temporal
deepfake dete ction. ey found that a 3D CNN, which looks at multiple frames at once, out performs
both recurrent networks and a the state of the art ID3 architecture.
To localize the tampered areas, some works train networks to predicting masks learned from a
ground truth dataset, or by mapping the neural activations back to the raw image [41, 92, 118, 149].
In general, we note that the use of classiers to detect deepfakes is problematic since an aacker
can evade detection via adversarial machine learning. We will discuss this issue further in Section 7.2.
Anomaly Detection.
In contrast to classication, anomaly detection models are trained on the
normal data and then detect outliers during deployment. By doing so, these methods do not make
assumptions on how the aacks look and thus generalize beer to unknown creation methods.
e authors of [
166
] follow this approach by measuring the neural activation (coverage) of a face
recognition network. By doing so, the model is able to overcome noise and other distortions, by
obtaining a stronger signal from than just using the raw pixels. Similarly, in [
81
] a one-class VAE
is trained to used to reconstruct real images. en, for new images, an anomaly score is computed
by taking the MSE between mean component of the encoded image and the mean component of the
reconstructed image. Alternatively, the authors of [
17
] measure an input’s embedding distance to
real samples using an ED’s latent space. e dierence between these works is that [
166
] and [
81
] rely
on a model’s inability to process unknown paerns while [
17
] contrasts the model’s representations.
Instead of using a neural network directly, the authors of [
51
] use a state of the art aribution
based condence metric (ABC). To detect a fake image, the ABC is used to determine if the image
ts the training distribution of a pretrained face recognition network (e.g., VGG).
6.2 Prevention & Mitigation
Data Provenance.
To prevent deepfakes, some have suggested that data provenance of multimedia
should be tracked through distributed ledgers and blockchain networks [
54
]. In [
44
] the authors
suggest that the content should be ranke d by participants and AI. In contrast, [
68
] proposes that the
content should authenticated and managed as a global le system over Etherium smart contracts.
Counter Attacks.
To combat deepfakes, the authors of [
102
] show how adversarial machine learn-
ing can be used to disrupt and corrupt deepfake networks. e authors perform adversarial machine
learning to add craed noise perturbations to
x
, which prevents deepfake technologies from locating
a proper face in
x
. In a dierent approach, the authors of [
138
] use adversarial noise to change the
identity of the face so that web crawlers will not be able nd the image of t to train their model.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:29
Table 3. Summary of Deepfake Detection Models
Type Modality Content Method Eval. Dataset Performance*
Reenactment
Replacement
Image
Video
Audio
Feature
Body Part
Face
Image
Model
Indicates Aected Area
Input Resolution
DeepfakeTIMIT [86]
DFFD [149]
FaceForensics [130]
FaceForensics++ [131]
FFW [82]
Celeb-DF [101]
Other Deepfake DB
Custom
ACC
EER
AUC
[187] 2017 SVM-RBF 250x250 92.9
[4] 2017
SVM * 18.2
[178] 2018
SVM * 0.97
[86] 2018
SVM 128x128 3.33
[42] 2019
SVM, Kmeans… 1024x1024 100
[8] 2019
SVM * 13.33
Classic ML
[6] 2019
SVM * 0.98
[111] 2018 CNN 256x256 99.4
[97] 2018
LSTM-CNN 224x224 0.99
[119] 2018
Capsule-CNN 128x128 99.3
[17] 2018
ED-GAN 128x128 92
[39] 2018
CNN 1024x1024 0.81
[63] 2018
CNN-LSTM 299x299 97.1
[106] 2018
CNN 256256 94.4
[33] 2018
CNN AE 256x256 90.5
[3] 2018
CNN 256x256 0.99
[132] 2019
CNN-LSTM 224x224 96.9
[118] 2019
CNN-DE 256x256 92.8 8.18
[38] 2019
CNN - 98.5
[41] 2019
CNN AE GAN 256x256 99.2
[149] 2019
CNN+Aention 299x299 3.11 0.99
[98] 2019
CNN 128x128 0.99
[101] 2019
CNN * 0.64
[52] 2019
CNN+HMN 224x224 99.4
[92] 2019
FCN 256x256 98.1
[177] 2019
CNN 128x128 94.7
[161] 2019
CNN 224x224 86.4
[153] 2019
CNN 1024x1024 94
[30] 2019
CNN 128x128 96
[99] 2019
CNN 224x224 93.2
[11] 2019
CNN 224x224 81.6
[? ] 2019
LSTM * 22
[47] 2019
LSTM-DNN * 16.4
[25] 2019
CNN 256x256 97
Deep Learning
[180] 2019
CNN 128x128 99.6 0.53
[166] 2019
SVM+VGGnet 224x224 85
[94] 2019
CNN 64x64 99.2
[95] 2020
HRNet-FCN 64x64 20.86 0.86
[96] 2020
PP-CNN - 0.92
[123] 2020
ED-CNN 299x299 0.99
[108] 2020
ED-LSTM 224x224
[167] 2020 CNN ResNet 224x224 Avrg. Prec.= 0.93
[64] 2020
AREN-CNN 128x128 98.52
[110] 2020
ED-CNN * 0.92
[5] 2020
CNN 128x128 89.6
[10] 2020
LSTM 256x256 94.29
[69] 2020
Siamese CNN 64x64 TPR=0.91
[129] 2020
Ensemble 224x224 99.65 1.00
[36] 2020
* 112x112 98.26 99.73
[81] 2020
OC-VAE 100x100 TPR=0.89
[51] 2020
ABC-ResNet 224x224 ?
[85] 2018 PRN U 1280x720 TPR=1 FPR= 0.03
[150] 2019
Statistics -
Statistics & Steganalysis
[107] 2019
PRNU * 90.3
*Only the best reported performance, averaged over the test datasets, is displayed to capture the ‘best-case’ scenario.
7 DISCUSSION
7.1 The Creation of Deepfakes
7.1.1 Trade-os Between the Methodologies. In general, there is a dierent cost and payo for
each deepfake creation method. However, the most ee ctive and threatening deepfakes are those
which are (1) the most practical to implement [Training Data, Execution Speed, and Accessibility]
and (2) are the most believable to the victim [ality]:
Data vs ality.
Models trained on numerous samples of the target oen yield beer results (e.g.,
[
25
,
55
,
71
,
73
,
89
,
105
,
152
,
174
]). For example, in 2017, [
152
] produced an extremely believable
reenactment of Obama which exceeds the quality of recent works. However, these models require
many hours footage for training, and are therefore are only suitable for exposed targets such as
actors, CEOs, and political leaders. An aacker who wants to commit defamation, impersonation,
or a scam on an arbitrary individual will need to use a many-to-many or few-shot approach. On
the other hand, most of these methods rely on a single reference of
t
and are therefore prone to
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:30 Mirsky, et al.
generating artifacts. is is because the model must ‘imagine missing information (e.g., dierent
poses and occlusions). erefore, approaches which provide the model with a limited number of
reference samples [
62
,
65
,
159
,
168
,
172
,
181
,
183
] strike the best balance between data and quality.
Speed vs ality.
e trade-o between these aspects depends on whether the aack is online
(interactive) or oine (stored media). Social engineering aacks involving deepfakes are likely
to be online and thus require real-time spee ds. However, high resolution models have many
parameters and sometimes use several networks (e.g., [
56
]) and some process multiple frames to
provide temporal coherence (e.g., [
15
,
83
,
169
]). Other methods may be slowed down due to their
pre/post-processing steps, such as warping [
60
,
62
,
186
], UV mapping or segmentation prediction
[
23
,
113
,
124
,
181
], and the use of renement networks [
25
,
60
,
83
,
93
,
112
,
154
]. To the best of
our knowledge, [
74
,
88
,
113
,
121
] and [
145
] are the only papers which claim to generate real time
deepfakes, yet they subjectively tend to be blurry or distort the face. Regardless, a victim is likely
fall for an imperfect deepfake in a social engineering aack when placed under pressure in a false
pretext [
173
]. Moreover, it is likely that an aacker will implement a complex method at a lower
resolution to speed up the frame rate. In which case, methods that have texture artifacts would be
preferredover those which produce shape or identity aws (e.g., [
145
] vs [
183
]). For aacks that are
not real-time (e.g, fake news), resolution and delity is critical. In these cases, works that produce
high quality images and videos with temporal coherence are the best candidates (e.g., [
65
,
169
]).
Availability vs ality.
We also note that availability and reproducibility are key factors in the
proliferation of new technologies. Works that publish their code and datasets online (e.g.,
[
79
,
135
,
145
,
162
,
172
,
174
]) are more likely to be used by researchers and criminals compared
to those which are unavailable [
2
,
55
,
65
,
83
,
121
,
127
,
154
,
171
,
185
] or require highly specic
or private datasets [
57
,
113
,
181
]. is is because the payo in implementing a paper is minor
compared to using a functional and eective method available online. Of course, this does not
include state-actors who have plenty of time and funding.
We have also observed that approaches which augment a network’s inputs with synthetic ones
produce beer results in terms of quality and stability. For example, by rotating limbs [
105
,
191
],
rening rendered heads [
14
,
55
,
113
,
154
,
170
,
179
], providing warped imagery [
60
,
112
,
117
,
182
]
and UV maps [
23
,
62
,
83
,
125
,
182
]. is is because the provided contextual information reduces the
problem’s complexity for the neural network.
Given these considerations, in our opinion, the most signicant and available deepfake technolo-
gies today are [
145
] for facial reenactment because of it’s eciency and practicality; [
27
] for mouth
reenactment because of its quality; and [71] for face replacement because its high delity and wide
spread use. However, this is a subjective opinion based on the samples provided online and in the re-
spective pap ers. A comparative research study, where the methods are trained on the same dataset and
evaluated by a number of people is necessary to determine the best quality deepfake in each category.
7.1.2 Research Trends. Over the last few years there has been a shi towards identity agnostic mod-
els and high resolution deepfakes. Some notable advancements include (1) unpaired self-supervised
training techniques to reduce the amount of initial training data, (2) one/few-shot learning which en-
ables identity the with a single prole picture, (3) improvements of face quality and identity through
AdaIN layers, disentanglement, and pix2pixHD network components, (4) uid and realistic videos
through temporal discriminators and optical ow prediction, and (5) the mitigation of boundary
artifacts by using secondary networks to blend composites into seamless imagery (e.g., [
55
,
154
,
170
]).
Another large advancement in this domain was the use of perceptual loss on a pre-trained VGG
Face recognition network. e approach boosts the facial quality signicantly, and as a result, has
been adopted in popular online deepfake tools [1, 139]. Another advancement being adopted is the
use of a network pipeline. Instead of enforcing a set of global losses on a single network, a pipeline of
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:31
networks is used where each network is tasked with a dierent responsibility (conversion, generation,
occlusions, blending, etc.) is give more control over the nal output and has been able to mitigate
most of the challenges mention in Section 3.7.
7.1.3 Current Limitations. Aside from quality, there are a few limitations with the current deep-
fake technologies. First, for reenactment, content is always driven and generated with a frontal pose.
is limits the reenactment to a very static performance. Today, this is avoided by face swapping the
identity onto a lookalike’s body, but a good match is not always possible and this approach has limited
exibility. Second, reenactments and replacements depend on the driver’s performance to deliver the
identity’s personality. We b elieve that next generation deepfakes will utilize videos of the target to
stylize the generated content with the expected expressions and mannerisms. is will enable a much
more automatic process of creating believable deepfakes. Finally, a new trend is real-time deepfakes.
Works such as [
74
,
121
] have achieved real-time deepfakes at 30fps. Although real-time deepfakes
are an enabler for phishing aacks, the realism is not quite there yet. Other limitations include the
coherent rendering of hair, teeth, tongues, shadows, and the ability to render the target’s hands
(especially when touching the face). Regardless, deepfakes are already very convincing [
131
] and
are improving at a rapid rate. erefore, it is important that we focus on eective countermeasures.
7.2 The Deepfake Arms Race
Like any bale in cyber security, there is an arms race between the aacker and defender. In our survey,
we observed that the majority deepfake detection algorithms assume a static game with the adversary:
ey are either focused on identifying a specic artifact, or do not generalize well to new distributions
and unseen aacks [
33
]. Moreover, based on the recent benchmark of [
101
], we observe that the per-
formance of state-of-the-art detectors are decreasing rapidly as the quality of the deepfakes improve.
Concretely, the three most recent benchmark datasets (DFD by Google [
120
], DFDC by Facebook [
40
],
and Celeb-DF by [
101
]) were released within one month of each other at the end of 2019. However, the
deepfake detectors only achieved an AUC of 0
.
86,0
.
76, and 0
.
66 on each of them respectively. Even
a false alarm rate of 0. 001 is far too low considering the millions of images published online daily.
Evading Artifact-based Detectors.
To evade an artifact-based detector, the adversary only needs to
mitigate a single aw to evade detection. For example,
G
can generate the biological signals monitored
by [
30
,
97
] by adding a discriminator which monitors these signals. To avoid anomalies in extensive
the neuron activation [
166
], the adversary can add a loss which minimizes neuron coverage. Methods
which detect abnormal poses and mannerisms [
6
] can be evaded by reenacting the entire head and by
learning the mannerisms from the same databases. Models which identify blurred content [
111
] are
aected by noise and sharpening GANs [
73
,
84
], and models which search for the boundary where the
face was blended in [
4
,
8
,
42
,
94
,
111
,
187
] do not work on deepfakes passed through rener networks,
which use in-painting, or those which output full frames (e.g., [
83
,
93
,
103
,
113
,
114
,
121
,
182
,
191
]).
Finally, solutions which search for forensic evidence [
85
,
107
,
180
] can be evaded (or at least raise the
false alarm rate) by passing
x
д
through lters, or by performing physical replication or compression.
Evading Deep Learning Classiers.
ere are a number of detection methods which apply deep
learning directly to the task of deepfake detection (e.g., [
3
,
38
,
39
,
52
,
153
]). However, an adversary can
use adversarial machine learning to evade detection by adding small perturbations to
x
д
. Advances in
adversarial machine learning has shown that these aacks transfer across multiple mo dels regardless
of the training data used [
126
]. Recent works have shown how these aacks not only work on
deepfakes classiers [
116
] but also work with no knowledge of the classier or it’s training set [
24
].
Moving Forward.
Nevertheless, deepfakes are still imperfect, and these methods oer a modest
defense for the time being. Furthermore, these works play an important role in understanding the
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:32 Mirsky, et al.
current limitations of deepfakes, and raise the diculty threshold for malicious users. At some point, it
may become to o time-consuming and resource-intensive a common aacker to create a good-enough
fake to evade detection. However, we argue that solely relying on the development of content-based
countermeasures is not sustainable and may lead to a reactive arms-race. erefore, we advocate for
more out-of-band approaches for detecting a preventing deepfakes. For example, the establishment
of content provenance and authenticity frameworks for online videos [
44
,
54
,
68
], and proactive
defenses such as the use of adversarial machine learning to protect content from tampering [102].
7.3 Deepfakes in other Domains
In this survey, we put a focus on human reenactment and replacement aacks; the type of deepfakes
which has made the largest impact so far [
12
,
66
]. However, deepfakes extend beyond human visuals,
and have spread many other domains. In healthcare, the authors of [
109
] showed how deepfakes can
be used to inject tor remove medical evidence in CT and MRI scan for insurance fraud, disruption, and
physical harm. In [
75
] it was shown how one’s voice can be cloned with only ve seconds of audio,
and in Sept. 2019 a CEO was scammed out of
$
250K via a voice clone deepfake [
37
]. e authors
of [
22
] have shown how deep learning can generate realistic human ngerprints that can unlock
multiple users’ devices. In [
136
] it was shown how deepfakes can be applied to nancial records to
evade the detection of auditors. Finally, it has b een shown how deepfakes of news articles can be
generated [184] and that deepfake tweets exist as well [50].
eseexamples demonstrate that deepfakes are notjust aack tools for misinformation, defamation,
and propaganda, but also sabotage, fraud, scams, obstruction of justice, and potentially many more.
7.4 What’s on the Horizon
We believe that in the coming years, we will see more deepfakes being weaponized for monetization.
e technology has proven itself in humiliation, misinformation, and defamtion aacks. Moreover,
the tools are becoming more practical [
1
] and ecient [
75
]. erefore, is seems natural that malicious
users will nd ways to use the technology for a prot. As a result, we expect to see an increase in
deepfake phishing aacks and scams targeting both companies and individuals.
As the technology matures, real-time deepfakes will become increasingly realistic. erefore, we
can expect that the te chnology will be used by hacking groups to perform reconnaissance as part
of an APT, and by state actors to perform espionage and sabotage by reenacting of ocials or family
members.
To keep ahead of the game, we must be proactive and consider the adversary’s next step, not
just the weaknesses of the current aacks. We suggest that more work be done on evaluating the
theoretical limits of these aacks. For example, by nding a bound on a model’s delay can help detect
real-time aacks such as [
75
], and determining the limits of GANs like [
7
] can help us devise the
appropriate strategies. As mentione d earlier, we recommend further research on solutions which
do not require analyzing the content itself. Moreover, we believe it would be benecial for future
works to explore the weaknesses and limitations of current deepfakes detectors. By identifying and
understanding these vulnerabilities, researchers will be able to develop stronger countermeasures.
8 CONCLUSION
Not all deepfakes are malicious. However, because the technology makes it so easy to create believable
media, malicious users are exploiting it to perform aacks. ese aacks are targeting individuals
and causing psychological, political, monetary, and physical harm. As time goes on, we expect to
see these malicious deepfakes spread to many other modalities and industries.
In this survey we focused on reenactment and replacement deepfakes of humans. We provided
a deep review of how these technologies work, the dierences between their architectures, and
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:33
what is being done to detect them. We hope this information will be helpful to the community in
understanding and preventing malicious deepfakes.
REFERENCES
[1] 2017. deepfakes/faceswap: Deepfakes Soware For All. hps://github.com/deepfakes/faceswap. (Accessed on 01/27/2020).
[2] Kr Aberman, Mingyi Shi, Jing Liao, D Liscbinski, Baoquan Chen, and Daniel Cohen-Or. 2019. Deep Video-Based Performance
Cloning. In Computer Graphics Forum. Wiley Online Library.
[3] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network.
In 2018 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1–7.
[4] Akshay Agarwal, Richa Singh, Mayank Vatsa, and Afzel Noore. 2017. SWAPPED! Digital face presentation aack detection via
weighted local magnitude paern. In 2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 659–665.
[5] Shruti Agarwal, Hany Farid, Ohad Fried, and Maneesh Agrawala. 2020. Detecting Deep-Fake Videos from Phoneme-Viseme
Mismatches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Paern Recognition Workshops. 660–661.
[6] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes.
In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition Workshops. 38–45.
[7] Sakshi Agarwal and Lav R Varshney. 2019. Limits of Deepfake Detection: A Robust Estimation Viewpoint. arXiv:1905.03493 (2019).
[8] Zahid Akhtar and Dipankar Dasgupta. [n.d.]. A Comparative Evaluation of Local Feature Descriptors for DeepFakes Detection. ([n. d.]).
[9] Rıza Alp G
¨
uler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings
of the IEEE Conference on Computer Vision and Paern Recognition.
[10] Irene Amerini and Roberto Caldelli. 2020. Exploiting Prediction Error Inconsistencies through LSTM-based Classiers to Detect
Deepfake Videos. In Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security. 97–102.
[11] Irene Amerini, Leonardo Galteri, Roberto Caldelli, and Alberto Del Bimbo. 2019. Deepfake Video Detection through Optical Flow
Based CNN. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0–0.
[12] Arije Antinori. 2019. Terrorism and DeepFake: from Hybrid Warfare to Post-Truth Warfare in a Hybrid World. In ECIAIR 2019 European
Conference on the Impact of Articial Intelligence and Robotics. Academic Conferences and publishing limited, 23.
[13] Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F. Cohen. 2017. Bringing Portraits to Life. ACM Transactions on
Graphics (Proceeding of SIGGRAPH Asia 2017) 36, 6 (2017), 196.
[14] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guag. 2018. Synthesizing images of humans in unseen
poses. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 8340–8348.
[15] Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. 2018. Recycle-gan: Unsupervised video retargeting. In Proceedings of
the European Conference on Computer Vision (ECCV).
[16] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. 2017. CVAE-GAN: ne-grained image generation through
asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision. 2745–2754.
[17] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. 2018. Towards open-set identity preserving face synthesis. In
Proceedings of the IEEE Conference on Computer Vision and Paern Recognition.
[18] Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. 2008. Face swapping: automatically replacing
faces in photographs. In ACM Transactions on Graphics (TOG), Vol. 27. ACM, 39.
[19] Volker Blanz, Curzio Basso, Tomaso Poggio, and omas Veer. 2003. Reanimating faces in images and video. In Computer graphics
forum, Vol. 22. Wiley Online Library, 641–650.
[20] Volker Blanz, Kristina Scherbaum, omas Veer, and Hans-Peter Seidel. 2004. Exchanging faces in images. In Computer Graphics
Forum, Vol. 23. Wiley Online Library, 669–676.
[21] Volker Blanz and omas Veer. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference
on Computer graphics and interactive techniques. 187–194.
[22] Philip Bontrager, Aditi Roy, Julian Togelius, Nasir Memon, and Arun Ross. 2018. DeepMasterPrints: Generating masterprints for dictio-
nary aacks via latent variable evolution. In 9th International Conference on Biometrics eory, Applications and Systems (BTAS). IEEE.
[23] Jie Cao, Yibo Hu, Bing Yu, Ran He, and Zhenan Sun. 2019. 3D aided duet GANs for multi-view face image synthesis. IEEE Transactions
on Information Forensics and Security 14, 8 (2019), 2028–2042.
[24] Nicholas Carlini and Hany Farid. 2020. Evading Deepfake-Image Detectors with White-and Black-Box Aacks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Paern Recognition Workshops. 658–659.
[25] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. 2019. Everybo dy dance now. In Proceedings of the IEEE International
Conference on Computer Vision. 5933–5942.
[26] Yao-Jen Chang and Tony Ezzat. 2005. Transferable videorealistic speech animation. In Proceedings of the 2005 ACM SIG-
GRAPH/Eurographics symposium on Computer animation. ACM, 143–151.
[27] Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic
pixel-wise loss. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 7832–7841.
[28] Robert Chesney and Danielle Keats Citron. 2018. Deep fakes: a looming challenge for privacy, democracy, and national security. (2018).
[29] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unied generative adversarial
networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and paern recognition.
[30] Umur Aybars Cici and Ilke Demir. 2019. FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals. arXiv preprint
arXiv:1901.02212 (2019).
[31] Umur Aybars Cici, Ilke Demir, and Lijun Yin. 2020. How Do the Hearts of Deep Fakes Beat? Deep Fake Source Detection via
Interpreting Residuals with Biological Signals. arXiv preprint arXiv:2008.11363 (2020).
[32] Valentina Conoer, Ecaterina Bodnari, Giulia Boato, and Hany Farid. 2014. Physiologically-based detection of computer generated
faces in video. In 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 248–252.
[33] Davide Cozzolino, Justus ies, Andreas Rossler, Christian Riess, Mahias Niessner, and Luisa Verdoliva. 2018. Forensictransfer:
Weakly-supervised domain adaptation for forgery detection. arXiv preprint arXiv:1812.02510 (2018).
[34] Kevin Dale, Kalyan Sunkavalli, Micah K Johnson, Daniel Vlasic, Wojciech Matusik, and Hanspeter Pster. 2011. Video face replacement.
In ACM Transactions on Graphics (TOG). ACM.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:34 Mirsky, et al.
[35] Rodrigo De Bem, Arnab Ghosh, Adnane Boukhayma, alaiyasingam Ajanthan, N Siddharth, and Philip Torr. 2019. A conditional
deep generative model of people in natural images. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE.
[36] Oscar de Lima, Sean Franklin, Shreshtha Basu, Blake Karwoski, and Annet George. 2020. Deepfake Detection using Spatiotemporal
Convolutional Networks. arXiv preprint arXiv:2006.14749 (2020).
[37] Jesse Demiani. 2019. A Voice Deepfake Was Used To Scam A CEO Out Of $243,000 - Forbes. hps://bit.ly/38sXb1I.
[38] Xinyi Ding, Zohreh Raziei, Eric C Larson, Eli V Olinick, Paul Krueger, and Michael Hahsler. 2019. Swapped Face Detection using Deep
Learning and Subjective Assessment. arXiv preprint arXiv:1909.04217 (2019).
[39] Nhu-Tai Do, In-Seop Na, and Soo-Hyung Kim. 2018. Forensics Face Detection From GANs Using Convolutional Neural Network.
[40] Brian Dolhansky, Russ Howes, Ben Paum, Nicole Baram, and Cristian Canton Ferrer. 2019. e Deepfake Detection Challenge (DFDC)
Preview Dataset. arXiv preprint arXiv:1910.08854 (2019).
[41] Mengnan Du, Shiva Pentyala, Yuening Li, and Xia Hu. 2019. Towards Generalizable Forgery Detection with Locality-aware
AutoEncoder. arXiv preprint arXiv:1909.05999 (2019).
[42] Ricard Durall, Margret Keuper, Franz-Josef Pfreundt, and Janis Keuper. 2019. Unmasking DeepFakes with simple Features. arXiv
preprint arXiv:1911.00686 (2019).
[43] P Ekman, W Friesen, and J Hager. 2002. Facial action coding system: Research Nexus. Network Research Information, Salt Lake City, UT
1 (2002).
[44] Chi-Ying Chen et al. 2019. A Trusting News Ecosystem Against Fake News from Humanity and Technology Perspectives. In 2019 19th
International Conference on Computational Science and Its Applications (ICCSA). IEEE, 132–137.
[45] Daniil Kononenko et al. 2017. Photorealistic monocular gaze redirection using machine learning. IEEE transactions on paern analysis
and machine intelligence 40, 11 (2017), 2696–2710.
[46] Liqian Ma et al. 2018. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Paern
Recognition. 99–108.
[47] Pavel Korshunov et al. 2019. Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features. In
International Conference on Machine Learning.
[48] Shengju Qian et al. 2019. Make a Face: Towards Arbitrary High Fidelity Face Manipulation. In Proceedings of the IEEE International
Conference on Computer Vision.
[49] Facebook. 2018. Facing Facts. hps://about..com/news/2018/05/inside-feed-facing-facts/#watchnow. (Accessed on 03/02/2020).
[50] Tiziano Fagni, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. 2020. TweepFake: about Detecting
Deepfake Tweets. arXiv preprint arXiv:2008.00036 (2020).
[51] Steven Fernandes, Sunny Raj, Rickard Ewetz, Jodh Singh Pannu, Sumit Kumar Jha, Eddy Ortiz, Iustina Vintila, and Margaret Salter.
2020. Detecting Deepfake Videos Using Aribution-Based Condence Metric. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Paern Recognition Workshops. 308–309.
[52] arindu Fernando, Clinton Fookes, Simon Denman, and Sridha Sridharan. 2019. Exploiting Human Social Cognition for the Detection
of Fake and Fraudulent Faces via Memory Networks. arXiv preprint arXiv:1911.07844 (2019).
[53] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
[54] Paula Fraga-Lamas and Tiago M Fernandez-Carames. 2019. Leveraging Distributed Ledger Technologies and Blockchain to Combat
Fake News. arXiv preprint arXiv:1904.05386 (2019).
[55] Ohad Fried, Ayush Tewari, Michael Zollhofer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian
eobalt, and Maneesh Agrawala. 2019. Text-based Editing of Talking-head Video. arXiv preprint arXiv:1906.01524 (2019).
[56] Chaoyou Fu, Yibo Hu, Xiang Wu, Guoli Wang, Qian Zhang, and Ran He. 2019. High Fidelity Face Manipulation with Extreme Pose
and Expression. arXiv preprint arXiv:1903.12003 (2019).
[57] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and Victor Lempitsky. 2016. Deepwarp: Photorealistic image resynthesis for
gaze manipulation. In European conference on computer vision. Springer.
[58] Pablo Garrido, Levi Valgaerts, Ole Rehmsen, orsten ormahlen, Patrick Perez, and Christian eobalt. 2014. Automatic face
reenactment. In Proceedings of the IEEE conference on computer vision and paern recognition. 4217–4224.
[59] Leon A Gatys, Alexander S Ecker, and Mahias Bethge. 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015).
[60] Jiahao Geng, Tianjia Shao, Youyi Zheng, Yanlin Weng, and Kun Zhou. 2019. Warp-guided GANs for single-photo facial animation.
ACM Transactions on Graphics (TOG) 37, 6 (2019), 231.
[61] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
[62] Kuangxiao Gu, Yuqian Zhou, and omas S Huang. 2020. FLNet: Landmark Driven Fetching and Learning Network for Faithful
Talking Facial Animation Synthesis.. In AAAI. 10861–10868.
[63] David Guera and Edward J Delp. 2018. Deepfake video detection using recurrent neural networks. In IEEE Conference on Advanced
Video and Signal Based Surveillance (AVSS). IEEE, 1–6.
[64] Zhiqing Guo, Gaobo Yang, Jiyou Chen, and Xingming Sun. 2020. Fake Face Detection via Adaptive Residuals Extraction Network.
arXiv preprint arXiv:2005.04945 (2020).
[65] Sungjoo Ha, Martin Kersner, Beomsu Kim, Seokjun Seo, and Dongyoung Kim. 2020. MarioNETte: Few-shot Face Reenactment
Preserving Identity of Unseen Targets. In Proceedings of the AAAI Conference on Articial Intelligence.
[66] Holly Kathleen Hall. 2018. Deepfake Videos: When Seeing Isn’t Believing. Cath. UJL & Tech 27 (2018), 51.
[67] Karen Hao. 2019. e biggest threat of deepfakes isnt the deepfakes themselves - MIT Tech Review. hps://www.technologyreview.
com/s/614526/the-biggest-threat-of-deepfakes-isnt-the-deepfakes-themselves/.
[68] Haya R Hasan and Khaled Salah. 2019. Combating De epfake Videos Using Blockchain and Smart Contracts. IEEE Access 7 (2019),
41596–41606.
[69] Chih-Chung Hsu, Yi-Xiu Zhuang, and Chia-Yen Lee. 2020. Deep fake image detection based on pairwise learning. Applied Sciences 10,
1 (2020), 370.
[70] Yibo Hu, Xiang Wu, Bing Yu, Ran He, and Zhenan Sun. 2018. Pose-guided photorealistic face rotation. In Proceedings of the IEEE
Conference on Computer Vision and Paern Recognition. 8398–8406.
[71] iperov. 2019. DeepFaceLab: DeepFaceLab is a tool that utilizes machine learning to replace faces in videos. hps:
//github.com/iperov/DeepFaceLab. (Accessed on 12/31/2019).
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:35
[72] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks.
In Proceedings of the IEEE conference on computer vision and paern recognition. 1125–1134.
[73] Seyed Ali Jalalifar, Hosein Hasani, and Hamid Aghajan. 2018. Speech-driven facial reenactment using conditional generative
adversarial networks. arXiv preprint arXiv:1803.07461 (2018).
[74] Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You said that?: Synthesising talking faces from audio. International
Journal of Computer Vision (2019), 1–13.
[75] Ye Jia, Yu Zhang, Ron Weiss, an Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui
Wu, et al. 2018. Transfer learning from speaker verication to multispeaker text-to-speech synthesis. In Advances in neural information
processing systems. 4480–4490.
[76] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European
conference on computer vision. Springer, 694–711.
[77] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In
Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 7122–7131.
[78] Tero Karras, Samuli Laine, Miika Aiala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2019. Analyzing and improving the image
quality of stylegan. arXiv preprint arXiv:1912.04958 (2019).
[79] Triantafyllos Kefalas, Konstantinos Vougioukas, YannisPanagakis, Stavros Petridis, Jean Kossai, and Maja Pantic. 2019. Speech-driven
facial animation using polynomial fusion of features. arXiv preprint arXiv:1912.05833 (2019).
[80] Ira Kemelmacher-Shlizerman. 2016. Transguring portraits. ACM Transactions on Graphics (TOG) 35, 4 (2016), 94.
[81] Hasam Khalid and Simon S Woo. 2020. OC-FakeDect: Classifying Deepfakes Using One-Class Variational Autoencoder. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Paern Recognition Workshops. 656–657.
[82] Ali Khodabakhsh, Raghavendra Ramachandra, Kiran Raja, Pankaj Wasnik, and Christoph Busch. 2018. Fake Face Detection Methods:
Can ey Be Generalized?. In 2018 International Conference of the Biometrics Special Interest Group (BIOSIG). IEEE, 1–6.
[83] Hyeongwoo Kim, Pablo Carrido, Ayush Tewari, Weipeng Xu, Justus ies, Mahias Niessner, Patrick Perez, Christian Richardt,
Michael Zollhofer, and Christian eobalt. 2018. Deep video portraits. ACM Transactions on Graphics (TOG) 37, 4 (2018), 163.
[84] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Accurate image super-resolution using very deep convolutional networks. In
Proceedings of the IEEE conference on computer vision and paern recognition. 1646–1654.
[85] Marissa Koopman, Andrea Macarulla Rodriguez, and Zeno Geradts. 2018. Detection of Deepfake Video Manipulation. In Conference:
IMVIP.
[86] Pavel Korshunov and Sebastien Marcel. 2018. Deepfakes: a new threat to face recognition? assessment and detection. arXiv preprint
arXiv:1812.08685 (2018).
[87] Pavel Korshunov and Sebastien Marcel. 2018. Speaker inconsistency detection in tampered video. In 2018 26th European Signal
Processing Conference (EUSIPCO). IEEE, 2375–2379.
[88] Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Lucas eis. 2017. Fast face-swap using convolutional neural networks. In
Proceedings of the IEEE International Conference on Computer Vision.
[89] Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brebisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync
from text. arXiv preprint arXiv:1801.01442 (2017).
[90] Dami Lee. 2019. Deepfake Salvador Dal takes seles with museum visitors - e Verge. hps://bit.ly/3cEim4m.
[91] Jessica Lee, Deva Ramanan, and Rohit Girdhar. 2019. MetaPix: Few-Shot Video Retargeting. arXiv preprint arXiv:1910.04742 (2019).
[92] Jia Li, Tong Shen, Wei Zhang, Hui Ren, Dan Zeng, and Tao Mei. 2019. Zooming into Face Forensics: A Pixel-level Analysis. arXiv
preprint arXiv:1912.05790 (2019).
[93] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. 2019. FaceShier: Towards High Fidelity And Occlusion Aware Face
Swapping. arXiv preprint arXiv:1912.13457 (2019).
[94] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2019. Face X-ray for More General Face
Forgery Detection. arXiv preprint arXiv:1912.13458 (2019).
[95] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020. Face x-ray for more general face
forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Paern Recognition. 5001–5010.
[96] Xurong Li, Kun Yu, Shouling Ji, Yan Wang, Chunming Wu, and Hui Xue. 2020. Fighting Against Deepfake: Patch&Pair Convolutional
Neural Networks (PPCNN). In Companion Proceedings of the Web Conference 2020. 88–89.
[97] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. 2018. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018
IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 1–7.
[98] Yuezun Li and Siwei Lyu. 2019. DSP-FWA: Dual Spatial Pyramid for Exposing Face Warp Artifacts in DeepFake Videos.
hps://github.com/danmohaha/DSP-FWA. (Accessed on 12/18/2019).
[99] Yuezun Li and Siwei Lyu. 2019. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In IEEE Conference on Computer
Vision and Paern Recognition Workshops (CVPRW).
[100] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. 2019. MixNMatch: Multifactor Disentanglement and Encodingfor
Conditional Image Generation. arXiv preprint arXiv:1911.11758 (2019).
[101] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. 2019. Celeb-DF: A New Dataset for DeepFake Forensics. arXiv
preprint:1909.12962 (2019).
[102] Yuezun Li, Xin Yang, Baoyuan Wu, and Siwei Lyu. 2019. Hiding Faces in Plain Sight: Disrupting AI Face Synthesis with Adversarial
Perturbations. arXiv preprint arXiv:1906.09288 (2019).
[103] Lingjie Liu, Weipeng Xu, Michael Zollhoefer, Hyeongwoo Kim, Florian Bernard, Marc Habermann, Wenping Wang, and Christian
eobalt. 2019. Neural rendering and reenactment of human actor vide os. ACM Transactions on Graphics (TOG) 38, 5 (2019), 139.
[104] Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. 2019. Liquid warping GAN: A unie d framework for human
motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE International Conference on Computer Vision.
[105] Zhaoxiang Liu, Huan Hu, Zipeng Wang, Kai Wang, Jinqiang Bai, and Shiguo Lian. 2019. Video synthesis of human upper body with
realistic face. arXiv preprint arXiv:1908.06607 (2019).
[106] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and Luisa Verdoliva. 2018. Detection of GAN-generated fake images over
social networks. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 384–389.
[107] Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Giovanni Poggi. 2019. Do GANs leave articial ngerprints?. In 2019 IEEE
Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 506–511.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:36 Mirsky, et al.
[108] Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Guruda, and Wael AbdAlmageed. 2020. Two-branch
Recurrent Network for Isolating Deepfakes in Videos. arXiv preprint arXiv:2008.03412 (2020).
[109] Yisroel Mirsky, Tom Mahler, Ilan Shelef, and Yuval Elovici. 2019. CT-GAN: Malicious Tampering of 3D Medical Imagery using Deep
Learning. In USENIX Security Symposium 2019.
[110] Trisha Mial, Uaran Bhaacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don’t Lie: A Deepfake
Detection Method using Audio-Visual Aective Cues. arXiv preprint arXiv:2003.06711 (2020).
[111] Huaxiao Mo, Bolin Chen, and Weiqi Luo. 2018. Fake faces identication via convolutional neural network. In Proceedings of the 6th
ACM Workshop on Information Hiding and Multimedia Security. ACM.
[112] Joel Ruben Antony Moniz, Christopher Beckham, Simon Rajoe, Sina Honari, and Chris Pal. 2018. Unsupervised depth estimation, 3d
face rotation and replacement. In Advances in Neural Information Processing Systems. 9736–9746.
[113] Koki Nagano, Jaewoo Seo, Jun Xing, Lingyu Wei, Zimo Li, Shunsuke Saito, Aviral Agarwal, Jens Fursund, Hao Li, Richard Roberts, et al.
2018. paGAN: real-time avatars using dynamic textures. ACM Trans. Graph. 37, 6 (2018), 258–1.
[114] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. 2018. FSNet: An Identity-Aware Generative Model for Image-Based Face
Swapping. In Asian Conference on Computer Vision. Springer, 117–132.
[115] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. 2018. RSGAN: face swapping and editing using face and hair representation
in latent spaces. arXiv preprint arXiv:1804.03447 (2018).
[116] Paarth Neekhara, Shehzeen Hussain, Malhar Jere, Farinaz Koushanfar, and Julian McAuley. 2020. Adversarial Deepfakes: Evaluating
Vulnerability of Deepfake Detectors to Adversarial Examples. arXiv preprint arXiv:2002.12749 (2020).
[117] Natalia Neverova, Riza Alp Guler, and Iasonas Kokkinos. 2018. Dense pose transfer. In Proceedings of the European conference on
computer vision (ECCV). 123–138.
[118] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019. Multi-task Learning For Detecting and Segmenting
Manipulated Facial Images and Videos. arXiv preprint arXiv:1906.06876 (2019).
[119] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. 2019. Capsule-forensics: Using capsule networks to detect forged images and
videos. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2307–2311.
[120] Andrew Gully Nick Dufour. 2019. DFD. hps://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html.
[121] Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. FSGAN: Subject Agnostic Face Swapping and Reenactment. In Proceedings of the IEEE
International Conference on Computer Vision. 7184–7193.
[122] Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, and Gerard Medioni. 2018. On face segmentation, face swapping, and face
perception. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 98–105.
[123] Yuval Nirkin, Lior Wolf, Yosi Keller, and Tal Hassner. 2020. DeepFake Detection Base d on the Discrepancy Between the Face and its
Context. arXiv preprint arXiv:2008.12262 (2020).
[124] Kyle Olszewski, Zimo Li, Chao Yang, Yi Zhou, Ronald Yu, Zeng Huang, Sitao Xiang, Shunsuke Saito, Pushmeet Kohli, and Hao Li. 2017.
Realistic dynamic facial textures from a single image using gans. In Proceedings of the IEEE International Conference on Computer Vision.
5429–5438.
[125] Naima Otberdout, Mohamed Daoudi, Anis Kacem, Lahoucine Ballihi, and Stefano Berrei. 2019. Dynamic Facial Expression Generation
on Hilbert Hypersphere with Conditional Wasserstein Generative Adversarial Nets. arXiv preprint arXiv:1907.10087 (2019).
[126] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box
aacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016).
[127] Hai X Pham, Yuting Wang, and Vladimir Pavlovic. 2018. Generative adversarial talking head: Bringing portraits to life with a weakly
supervised neural network. arXiv preprint arXiv:1803.07716 (2018).
[128] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2019. GANimation: One-shot
anatomically consistent facial animation. International Journal of Computer Vision (2019), 1–16.
[129] Md Shohel Rana and Andrew H Sung. 2020. DeepfakeStack: A Deep Ensemble-based Learning Technique for Deepfake Detection.
In 2020 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/2020 6th IEEE International Conference on
Edge Computing and Scalable Cloud (EdgeCom). IEEE, 70–75.
[130] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus ies, and Mahias Niessner. 2018. Faceforensics: A
large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179 (2018).
[131] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus ies, and Mahias Niessner. 2019. Faceforensics++:
Learning to detect manipulated facial images. arXiv preprint:1901.08971 (2019).
[132] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent-Convolution
Approach to De epFake Detection-State-Of-Art Results on FaceForensics++. arXiv preprint arXiv:1905.00582 (2019).
[133] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training
gans. In Advances in neural information processing systems. 2234–2242.
[134] Sigal Samuel. 2019. A guy made a deepfake app to turn photos of women into nudes. It didnt go well. hps://www.vox.com/2019/6/
27/18761639/ai-deepfake-deepnude-app-nude-women-porn.
[135] Enrique Sanchez and Michel Valstar. 2018. Triple consistency loss for pairing distributions in GAN-based face synthesis. arXiv preprint
arXiv:1811.03492 (2018).
[136] Marco Schreyer, Timur Saarov, Bernd Reimer, and Damian Borth. 2019. Adversarial Learning of Deepfakes in Accounting. arXiv
preprint arXiv:1910.03810 (2019).
[137] Oscar Schwartz. 2018. You thought fake news was bad? e Guardian. hps://www.theguardian.com/technology/2018/nov/12/
deep-fakes-fake-news-truth. (Accessed on 03/02/2020).
[138] Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting Privacy against
Unauthorized Deep Learning Models. In 29th {USENIX} Security Symposium ({USENIX} Security 20). 1589–1604.
[139] shaoanlu. 2018. faceswap-GAN: A denoising autoencoder + adversarial losses and aention mechanisms for face swapping.
hps://github.com/shaoanlu/faceswap-GAN. (Accessed on 12/17/2019).
[140] Shaoanlu. 2019. fewshot-face-translation-GAN: Generative adversarial networks integrating modules from FUNIT and SPADE for
face-swapping. hps://github.com/shaoanlu/fewshot-face-translation-GAN.
[141] Yujun Shen, Ping Luo, Junjie Yan, Xiaogang Wang, and Xiaoou Tang. 2018. Faceid-gan: Learning a symmetry three-player gan for
identity-preserving face synthesis. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 821–830.
[142] Yujun Shen, Bolei Zhou, Ping Luo, and Xiaoou Tang. 2018. FaceFeat-GAN: a Two-Stage Approach for Identity-Preserving Face
Synthesis. arXiv preprint arXiv:1812.01288 (2018).
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
The Creation and Detection of Deepfakes: A Survey 1:37
[143] TaikiShimba, Ryuhei Sakurai, Hirotake Yamazoe, and Joo-Ho Lee. 2015. Talking heads synthesis from audio with deep neural networks.
In 2015 IEEE/SICE International Symposium on System Integration (SII). IEEE, 100–105.
[144] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. Animating arbitrary obje cts via deep
motion transfer. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 2377–2386.
[145] Aliaksandr Siarohin, St
´
ephane Lathuili
`
ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model
for Image Animation. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 7135–7145.
hp://papers.nips.cc/paper/8935-rst-order-motion-model-for-image-animation.pdf
[146] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuiliere, and Nicu Sebe. 2018. Deformable gans for pose-based human image
generation. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 3408–3416.
[147] Yang Song, Jingwen Zhu, Xiaolong Wang, and Hairong Qi. 2018. Talking face generation by conditional recurrent adversarial network.
arXiv preprint arXiv:1804.04786 (2018).
[148] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. 2017. Char2wav:
End-to-end speech synthesis. Openreview.net (2017).
[149] Joel Stehouwer, Hao Dang, Feng Liu, Xiaoming Liu, and Anil Jain. 2019. On the Detection of Digital Face Manipulation. arXiv preprint
arXiv:1910.01717 (2019).
[150] Jeremy Straub. 2019. Using subject face brightness assessment to detect deep fakes(Conference Presentation). In Real-Time Image
Processing and Deep Learning 2019, Vol. 10996. International Society for Optics and Photonics, 109960H.
[151] Qianru Sun, Ayush Tewari, Weipeng Xu, Mario Fritz, Christian eobalt, and Bernt Schiele. 2018. A hybrid model for identity
obfuscation by face replacement. In Proceedings of the European Conference on Computer Vision (ECCV). 553–569.
[152] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio.
ACM Transactions on Graphics (TOG) 36, 4 (2017), 95.
[153] Shahroz Tariq, Sangyup Lee, Hoyoung Kim, Youjin Shin, and Simon S Woo. 2018. Detecting both machine and human created fake
face images in the wild. In Proceedings of the 2nd International Workshop on Multimedia Privacy and Security. ACM, 81–87.
[154] Justus ies, Mohamed Elgharib, Ayush Tewari, Christian eobalt, and Mahias Niessner. 2019. Neural Voice Puppetry: Audio-driven
Facial Reenactment. arXiv preprint arXiv:1912.05566 (2019).
[155] Justus ies, Michael Zollhofer, and Mahias Niessner. 2019. Deferred Neural Rendering: Image Synthesis using Neural Textures.
arXiv preprint arXiv:1904.12356 (2019).
[156] Justus ies, Michael Zollhofer, Mahias Niessner, Levi Valgaerts, Marc Stamminger, and Christian eobalt. 2015. Real-time
expression transfer for facial reenactment. ACM Trans. Graph. 34, 6 (2015), 183–1.
[157] Justus ies, Michael Zollhofer, Marc Stamminger, Christian eobalt, and Mahias Niessner. 2016. Face2face: Real-time face capture
and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 2387–2395.
[158] Justus ies, Michael Zollhofer, Christian eobalt, Marc Stamminger, and Mahias Niessner. 2018. Headon: Real-time reenactment
of human portrait videos. ACM Transactions on Graphics (TOG) 37, 4 (2018), 164.
[159] Luan Tran, Xi Yin, and Xiaoming Liu. 2018. Representation learning by rotating your faces. IEEE transactions on paern analysis and
machine intelligence 41, 12 (2018), 3007–3021.
[160] Soumya Tripathy, Juho Kannala, and Esa Rahtu. 2019. ICface: Interpretable and Controllable Face Reenactment Using GANs. arXiv
preprint arXiv:1904.01909 (2019).
[161] Xiaoguang Tu, Hengsheng Zhang, Mei Xie, Yao Luo, Yuefei Zhang, and Zheng Ma. 2019. Deep Transfer Across Domains for Face
Anti-spoong. arXiv preprint arXiv:1901.05633 (2019).
[162] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. Mocogan: Decomposing motion and content for video generation.
In Proceedings of the IEEE conference on computer vision and paern recognition. 1526–1535.
[163] Daniel Vlasic, Mahew Brand, Hansp eter Pster, and Jovan Popovic. 2006. Face transfer with multilinear models. In ACM SIGGRAPH
2006 Courses. 24–es.
[164] Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. End-to-End Speech-Driven Realistic Facial Animation with
Temporal GANs. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition Workshops. 37–40.
[165] Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic Speech-Driven Facial Animation with GANs. arXiv
preprint arXiv:1906.06337 (2019).
[166] Run Wang, Lei Ma, Felix Juefei-Xu, Xiaofei Xie, Jian Wang, and Yang Liu. 2019. Fakespoer: A simple baseline for spoing
ai-synthesized fake faces. arXiv preprint arXiv:1909.06122 (2019).
[167] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy
to spot… for now. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition, Vol. 7.
[168] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. 2019. Few-shot Video-to-Video Synthesis.
In Advances in Neural Information Processing Systems (NeurIPS).
[169] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-Video Synthesis.
In Advances in Neural Information Processing Systems (NeurIPS).
[170] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis
and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and paern recognition.
[171] Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. 2020. ImaGINator: Conditional Spatio-Temporal GAN for
Video Generation.
[172] Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio,
and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV). 670–686.
[173] Michael Workman. 2008. Wisecrackers: A theory-grounded investigation of phishing and pretext social engineering threats to
information security. Journal of the American Society for Information Science and Technology 59, 4 (2008), 662–674.
[174] Wayne Wu, Yunxuan Zhang, Cheng Li, Chen Qian, and Chen Change Loy. 2018. Reenactgan: Learning to reenact faces via boundary
transfer. In Proceedings of the European Conference on Computer Vision (ECCV). 603–619.
[175] Fanyi Xiao, Haotian Liu, and Yong Jae Lee. 2019. Identity from here, Pose from there: Self-supervised Disentanglement and Generation
of Objects using Unlabeled Videos. In Proceedings of the IEEE International Conference on Computer Vision. 7013–7022.
[176] Runze Xu, Zhiming Zhou, Weinan Zhang, and Yong Yu. 2017. Face transfer with generative adversarial network. arXiv
preprint:1710.06090 (2017).
[177] Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. 2019. On the generalization of GAN image forensics. In Chinese Conference on
Biometric Recognition. Springer, 134–141.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:38 Mirsky, et al.
[178] Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. In ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8261–8265.
[179] Lingyun Yu, Jun Yu, and Qiang Ling. 2019. Mining Audio, Text and Visual Information for Talking Face Generation. In 2019 IEEE
International Conference on Data Mining (ICDM). IEEE, 787–795.
[180] Ning Yu, Larry S Davis, and Mario Fritz. 2019. Aributing fake images to gans: Learning and analyzing gan ngerprints. In Proceedings
of the IEEE International Conference on Computer Vision.
[181] Yu Yu, Gang Liu, and Jean-Marc Odobez. 2019. Improving few-shot user-specic gaze adaptation via gaze redirection synthesis. In
Proceedings of the IEEE Conference on Computer Vision and Paern Recognition. 11937–11946.
[182] Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, and Leonid Sigal. 2019. DwNet: Dense warp-based network for pose-guided human
video generation. arXiv preprint arXiv:1910.09139 (2019).
[183] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural
Talking Head Models. arXiv preprint arXiv:1905.08233 (2019).
[184] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defend-
ing Against Neural Fake News. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 9054–9065.
hp://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf
[185] Jiangning Zhang, Xianfang Zeng, Yusu Pan, Yong Liu, Yu Ding, and Changjie Fan. 2019. FaceSwapNet: Landmark Guided
Many-to-Many Face Reenactment. arXiv preprint arXiv:1905.11805 (2019).
[186] Yunxuan Zhang, Siwei Zhang, Yue He, Cheng Li, Chen Change Loy, and Ziwei Liu. 2019. One-shot Face Reenactment. arXiv preprint
arXiv:1908.03251 (2019).
[187] Ying Zhang, Lilei Zheng, and Vrizlynn LL ing. 2017. Automated face swapping and its detection. In 2017 IEEE 2nd International
Conference on Signal and Image Processing (ICSIP). IEEE, 15–19.
[188] Lilei Zheng, Ying Zhang, and Vrizlynn LL ing. 2019. A survey on image tampering and its detection in real-world photos. Journal
of Visual Communication and Image Representation 58 (2019), 380–399.
[189] Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual
representation. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 9299–9306.
[190] Yuqian Zhou and Bertram Emil Shi. 2017. Photorealistic facial expression synthesis by the conditional dierence adversarial
autoencoder. In 2017 Seventh International Conference on Aective Computing and Intelligent Interaction (ACII). IEEE, 370–376.
[191] Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. 2019. Dance Dance Generation: Motion Transfer for Internet
Videos. arXiv preprint arXiv:1904.00129 (2019).
[192] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.
[193] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. 2019. Progressive Pose Aention Transfer for Person
Image Generation. In Proceedings of the IEEE Conference on Computer Vision and Paern Recognition.
ACM Computing Surveys, Vol. 1, No. 1, Article 1. Publication date: January 2020.